Unit 12 


Review 


Introduction 


This unit is designed to help you consolidate what you have learned in 
M140. It describes some extensions of ideas you have already met, as well 
as a small number of new statistical ideas. The aim is to draw together 
parts of the module that are closely linked, even though they may have 
been in different units. Each section will review one or two topics and 
contain a number of activities to help you refresh your knowledge of them. 


Section 1 reviews descriptive statistics and summary statistics, which 
were the focus of Units 1, 2 and 3 (Book 1). There is new material in 
Subsection 1.2, which introduces growth charts. 


Section 2 discusses the collection of data through surveys and in 
experiments, including clinical trials. The material is based mainly on 
Units 4, 10 and 11, with the addition of material on combining survey 
methods (Subsection 2.1). 


Section 3 reviews properties of probability given in Units 6 and 8. 
Also, in Subsection 3.2, the general form of the binomial distribution 
is introduced. (Unit 6 used a specialised form of this distribution.) 


Section 4 describes the principal steps in a hypothesis test (Unit 6) and 
considers the y? test of independence in a contingency table (Unit 8). 


Section 5 reviews the properties of the normal distribution and 
examines hypothesis tests and confidence intervals for making 
inferences about the mean of a population or the difference between 
two population means (Units 7, 9 and 10). A two-sample t-test for 
populations with unequal variances is introduced. 


Section 6 concerns relationships between two variables and reviews 
regression and correlation, which were the main focus of Units 5 and 9. 


Section 7 uses Minitab to explore binomial distributions and perform 
two-sample t-tests. 


In planning your study, you should note that there is new (assessable) 
material in Subsection 1.2, a small part of Subsection 2.1, most of 
Subsection 3.2 and a large part of Subsection 5.2. 


Section 7 directs you to the Computer Book. You are also guided to the 
Computer Book at the end of Sections 3 and 5. It is better to do the work 
at those points in the text, although you can leave it until later if you 
prefer. 


Introduction 
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1 Summarising data 


One reason for summarising data is to be able to report the data 
succinctly, perhaps quoting its median value or range in order to describe 
features of the data. As well as numeric summaries, figures such as a 
boxplot or stemplot can be used for this purpose and are very informative. 
Another important reason for summarising data, which you saw in later 
units, is that summary statistics are often all that are needed for 
performing hypothesis tests or calculating confidence intervals. In this 
context there is little choice as to how the data should be summarised. For 
example, the sample mean, the sample standard deviation and the sample 
size are the information required from the data in order to perform a 
one-sample z-test. 


In Subsections 1.1 and 1.2, numerical statistics for summarising data are 
described. In Subsection 1.3, we turn to graphical summaries. 

In Subsection 1.4, we focus on the Retail Prices Index (RPI) and other 
indexes for summarising data on prices and earnings. All these topics are 
primarily taught in Units 1 to 3. 


1.1 Common numerical summaries 


Numbers that are used to summarise data are referred to as summary 
statistics. Usually, the key quantities used to summarise a set of data are 
its median or mean, along with its interquartile range, standard deviation 
or variance. 


Median and mean 


The median is the middle item in a set of data (if the number of items 
in the batch is odd) or the average of the middle two items (if there 
are an even number of items in the batch). 


The mean is the average value in a set of data, given by: 


ze 


n 


T= 


where n is the number of items in the set of data. 


If you have to give just one number to summarise a set of data, then the 
median or mean are the obvious choices — one gives the middle of the data 
and the other its average. The two are quite close if the data have a fairly 
symmetric distribution, but will differ more if the data show great 
skewness. When the data are highly skew, the median is often more 
representative of the data. 
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Figure 1 Examples of reasonably symmetric and highly skew datasets 


Activity 1 Time between elections 


+ 
Li 


Following the Fixed-term Parliaments Act 2011, general elections in the 
UK take place every five years. Before that, an election could be called at 
any point the Prime Minister wished. 


The following are the times (in months) between elections in the 
years 1970 to 2010: 


44, 7, 55, 49, 48, 58, 61, 49, 47, 60. 
(a) What is the median of these data? 
(b) What is the mean of these data? 


(c) Which observation is most responsible for the difference between the 
median and mean? Would you consider either the mean or median 
unrepresentative of these data? 


The median and mean each describe the location of a set of data. They 
give a value that, in some sense, the data are centred around. The 
interquartile range, standard deviation and variance are each a measure of 
the extent to which a set of data is spread out. 


Interquartile range 


The central half of a set of data lies between the lower quartile and 
the upper quartile. The distance between these quartiles is the 
interquartile range. 


In the ordered set of data, the lower quartile is at position (n + 1)/4 
and the upper quartile is at position 3(n + 1)/4. 
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Activity 2 Interquartile range examples 


The following two sets of data have each been ordered from lowest to 
highest. The first set contains 15 data values and the second contains 
20 data values. Determine the interquartile range of each set. 


(a) 47, 49, 56, 57, 58, 58, 63, 63, 63, 64, 64, 65, 66, 68, 73. 


(b) 13, 15, 16, 16, 17, 19, 20, 20, 20, 20, 21, 21, 21, 22, 23, 
24, 26, 27, 28, 29. 


Variance and standard deviation 


The variance is the squared differences between each data value and 
the sample mean, added together and divided by n — 1 (where n is 
the sample size). 


The standard deviation is the square root of the variance. 


Although, from its definition, the formula for the sample variance (s?) is 


2_ (r-T) 
S a aes | 
m=i 
it is quicker to separately calculate > x? and Y x and apply the equivalent 
formula 


t-r (27-2). 


Activity 3 Lowestoft daily temperatures 


The following are the mean daily maximum temperatures (in °C) in 
Lowestoft for July, in the years from 2002 to 2010: 


20.5, 21.6, 19.6, 19.9, 23.7, 20.8, 21.1, 21.2, 23.6. 


Determine the variance and the standard deviation of these data. 


A disadvantage of the variance is that its scale is not the scale of the 
original data. For example, suppose the data are the times taken to 
perform a task and each of these times is typically within 15 seconds of 
2 minutes. Then the standard deviation might be, say, 10 seconds. This 
quantity is readily interpreted and can be related to the data values. In 
contrast, the variance would be 100 seconds?, which cannot be related to 
the data values quite so easily. 


The reason the variance is important is that it has good mathematical 
properties. For instance, to perform a two-sample z-test (Unit 7) requires 
ESE: the estimated standard error of TA — Tg. This is calculated, not from 
the two standard deviations są and sp, but from the variances s2 and S 


ESE = ,/24 4 52. 
nA NB 


When a set of data is to be summarised by giving one quantity to indicate 
its location and another to indicate its spread, we generally give either the 
mean and standard deviation, or the median and interquartile range. 
Quoting other pairings, such as the mean and interquartile range, is less 
common. 


The quantities used most commonly in statistical calculations (such as 
when forming confidence intervals or testing hypotheses) are means, 
variances and standard deviations. Even when testing whether a 
population median takes a specified value, we do not need the sample 
median. 


1.2 Numerical summaries used less frequently 


This subsection contains new material, not previously covered in 
M140. 


The largest and smallest values in a dataset, Ey and Ez, are quite often 
recorded in conjunction with other summary statistics so as to give a fuller 
description of the data. (For example, they are included in five-figure 
summary tables.) Inspecting the values of Ey or Ey is a useful step in 
cleaning data prior to statistical analysis as unusual values can highlight 
major errors in a dataset, perhaps caused by typing errors or other errors 
in recording the data. 


In contrast, the range, Ey — Ez, is seldom of interest because it is heavily 
influenced by the odd high or low value, so is often a poor reflection of the 
typical spread in a dataset. Similarly, the mid-range, (Eu + Ez)/2, and 
mode are seldom used as summary statistics even though they could be 
used as measures of location — they are generally less representative of the 
centre of the data than the mean or median, so one of the latter is used 
instead. 


An informative way of giving a detailed summary of a large set of data is 
to identify some of its percentiles. As well as the median (50th percentile), 
lower quartile (25th percentile) and upper quartile (75th percentile), some 
of the deciles (10th, 20th, ..., 90th percentiles) might also be given. Also, 
in forming confidence intervals and prediction intervals, the 25 and 974 
percentiles are important as they are the end-points of a 95% interval. 
While confidence intervals summarise the results of a statistical analysis, 
rather than simply summarising a set of data, they illustrate that the more 
extreme percentiles can be of interest. Hence, it is often useful to include 
percentiles of less than 10% and more than 90% in a detailed summary of 
data. 


To avoid a reader being swamped with numbers, the information from a 
large number of percentiles might be presented in a diagram. Figure 2 gives 
an example called a growth chart. It shows the distribution of weights in 
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a very large sample of boys in the first year of life. The percentiles that are 
given are the 0.4th, 2nd, 9th, 25th, 50th, 75th, 91st, 98th and 99.6th. 


The type of chart in Figure 2 is given to mothers leaving hospital after 
giving birth. It should reassure the majority of new parents that their 
baby is growing normally, while hopefully ringing alarm bells when a 
baby’s weight is unusually high or low. Reading values from the graph 
shows, for instance, that at 28 weeks old: 


e 25% of boys are below 7.5kg (and 75% are above that weight). 
e 9% of boys are below 7kg. 
e 2% are above 10kg. 


BOYS WEIGHT (kg) 
0-1 year 














Some degree of weight 
loss is common after birth. 
Beaks the percentage 
a: loss is a useful way 
| to identify babies who 
need extra support. 
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Figure 2 
Growth chart showing percentiles of boys’ weights in first year of life 
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Activity 4 Percentiles from a growth chart 


Use Figure 2 to give the proportion of boys who weigh: 
(a) less than 4.5kg when 10 weeks old; 

(b) more than 10kg when 46 weeks old; 

(c) between 6kg and 7.5kg when 22 weeks old. 





You have now covered the material related to Screencast 1 for Hp 
Unit 12 (see the M140 website). m- 
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1.3 Graphical summaries 


Boxplots and stemplots are commonly used as graphical summaries of 
data. If there are no unusually large or small data, a boxplot gives 
precisely the same information as a five-figure summary, except that the 
boxplot does not give n, the sample size. 


Five-figure summary 


n batch size 
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Details for drawing boxplots are 
given in Subsection 2.2 of Unit 3. 
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Figure 3 A standard boxplot 


Activity 5 Five-figure summary and boxplot 


The following data were given in part (a) of Activity 2 (Subsection 1.1): 
47, 49, 56, 57, 58, 58, 63, 63, 63, 64, 64, 65, 66, 68, 73. 

(a) Produce a five-figure summary of the data. 

(b) Produce a boxplot of the data. 


(c) Explain whether the boxplot indicates marked skewness in the data. 


If there are very large or very small values (relative to the main body of 
data), then in a boxplot these are marked individually and the whiskers 
only extend as far as the lower adjacent value (for the left whisker) or the 
upper adjacent value (for the right whisker). 


A stemplot resembles a histogram that has been turned on its side. It 
shows the shape of the distribution and contains most (or all) of the 
information in the data. Examples of stemplots are given in Figure 1 
(Subsection 1.1). The following is a stemplot of the data in Activity 5. 


NOD Oro 





n =15 4] 7 represents 47 


From this stemplot we could give each value with full accuracy — no 
information is lost. In Activity 6, a little information is lost because the 
initial data are given to two decimal places, while the stemplot only gives 
one decimal place. 


Activity 6 Olympic times for the 800 metres 


The following data give the times of 24 athletes in the semi-finals of the 
men’s 800-metre race at the 2012 Olympic Games. (A 25th runner was 
disqualified.) The times are given in seconds above 1 minute. For example, 
the fastest time of 1 minute 44.34 seconds is given below as 44.34 seconds. 
The data values have been ordered from fastest to slowest. 


Table 1 800-metre race times, in seconds above 1 minute 
44.34 44.35 44.51 44.54 44.63 44.74 44.87 44.93 
45.08 45.09 45.10 45.34 45.44 45.63 45.84 45.85 
46.14 46.19 46.29 46.66 47.52 48.18 48.98 53.46 
(Data source: Official website of the Olympic Movement) 


(a) Construct a stretched stemplot of the data, in which each whole 
second is split between two levels, and one outlier is listed separately. 


(b) Comment on the shape of the stemplot. 





In summarising data, the following is the order of priority. 


1. If data must be summarised by just one number, then a number 
that represents the location of the data should be given (usually 
the median or mean). 


2. If two numbers are to be used as the summary, then the second 
number should indicate the spread of the data (usually the 
interquartile range, standard deviation or variance). 


3. Additional information would describe the shape of the data, 
notably any skewness, and identify the largest and smallest data 
along with any numbers that are extreme relative to the main 
body of the data. 


Graphical summaries can convey a lot of information in an accessible 
way. 
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1.4 Summarising changes in prices and earnings 


The Retail Prices Index (RPI) and Consumer Prices Index (CPI) are both 
used to summarise the overall change in the level of prices paid by people 
for the goods and services they buy. These price indexes are calculated in 
similar ways, and here we will focus on the RPI. Both use a very large 
‘basket of goods’ that is designed to reflect the pattern of spending in the 
UK. Figure 4 shows the make-up of the RPI basket in 2012. 


Leisure goods 


Fares and 
other travel 
costs 


Figure 4 Structure of the RPI in 2012 (based on data from the Office for 
National Statistics) 


As can be seen, the RPI is divided into five broad groupings. The inner 
ring shows, for example, that the typical household spends about twice as 
much on the group ‘Food and catering’ as on ‘Personal expenditure’. The 
five groupings are divided into 14 more detailed subgroups, which are 
themselves divided into sections. 


Certain items within each section are priced. For instance, within the 
‘Food and catering’ group there is a ‘Bread’ section and the prices of 
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representative items of bread (such as a large white sliced loaf and bread 
rolls) are monitored each month in a number of shops and supermarkets. 
For each item, its prices in the current month are compared with its prices 
in the previous January and a price ratio is calculated that fairly reflects 
how the price of the item has changed across the country. 


Weighted mean of price ratios 


The weighted mean of two or more numbers is: 
sum of {number x weight} 





sum of weights 
So the weighted mean of two or more price ratios is: 
sum of {price ratio x weight} 
sum of weights 


1. First, the price ratios of items within a subgroup are combined by 
taking their weighted mean — using weights that reflect expenditure 
patterns for the different items. These give a price ratio for each 
subgroup. 

2. Next, the price ratios of subgroups within a group are combined by 
taking their weighted mean, giving a price ratio for the group. 

3. Lastly, group price ratios are combined by taking their weighted mean, 
giving the all-item price ratio for that month. 


The weights are determined from survey data on people’s expenditures. 
They are set each January and used for a year. Calculating the all-item 
price ratio from the group price ratios is illustrated in Activity 7. 


Activity 7 All-item price ratio for 2013 +a 


Group price ratios (r) for August 2013 relative to January 2013 are given 
in Table 2, where the weights (w) for 2013 are also given. Complete the 
last column of the table and show that the sum is 1021.086. Hence show 
that the all-item price ratio for August 2013 (relative to January 2013) is 
approximately 1.021. 


Table 2 Weights and price ratios for August 2013 


Price ratio for August 2013 2013 Price ratio 
relative to January 2013 weights x weight 


Group (r) (w) (r x w) 
Food and catering 1.013 163 

Alcohol and tobacco 1.021 91 

Housing and household 

expenditure 1.017 419 

Personal expenditure 1.055 83 

Travel and leisure 1.022 244 


(Source: Office for National Statistics) 
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The RPI in January 2013 was derived from the RPI in January 2012, 
which in turn was derived from the RPI in 2011, and so on. Hence the RPI 
is a chained index. To give the chain a starting point, the RPI is set equal 
to 100 at a base date. The current base date for the RPI is January 1987. 


The RPI in any month is obtained through multiplying ‘that month’s price 
ratio relative to the previous January’ by ‘the RPI in the previous 
January’. For example: 


e The price ratio for January 2013 relative to January 2012 (the 
previous January) was 1.0328 and the RPI in January 2012 was 238.0. 
Hence the RPI in January 2013 was 1.0328 x 238.0 ~ 245.8. 


e As the price ratio for August 2013 relative to January 2013 was 1.021 
(from Activity 7), the RPI in August 2013 was 1.021 x 245.8 ~ 251.0. 


The change in a person’s annual expenditure seldom reflects change in 
prices, which is the change that the RPI aims to capture. Rather, the 
change in a person’s annual expenditure probably reflects a change in the 
person’s annual income or a change in their personal circumstances. It is 
for this reason that the RPI requires a notional basket of goods. In 
contrast, finding the change in a person’s annual earnings simply requires 
knowledge of their earnings. Moreover, information on a person’s earnings 
is usually carefully recorded — a by-product of the UK system of income 
tax. Hence, constructing an index of earnings does not face the same 
challenges that the RPI must overcome. There are, however, adjustments 
that surveys of earnings must make. For example, the Average Weekly 
Earnings (AWE) index seasonally adjusts figures to allow for the effect of 
changes in earnings that occur regularly at fixed times of the year. 
Another example is the Annual Survey of Hours and Earnings (ASHE), 
which only collects information on paid employees and must make 
adjustments for the self-employed and unemployed, amongst others. Some 
detail is given in Unit 3. 


2 Collecting data 


Throughout this module we have examined samples of data. Sometimes 
the purpose of a sample is to learn about the population from which it 
comes, so the sample needs to be representative of the whole population 
and will usually need to be diverse. Another reason for gathering sample 
data is to perform an experiment. Then, often, the items selected for the 
experiment should be as similar as possible, so that differences between 
treatments (say) are not obscured by random variation between items. 


Simple random sampling is the most common method of sampling, and 
many methods of testing hypotheses or forming confidence intervals 
assume that the data are a simple random sample from the population of 
interest. This is true of the sign test in Unit 6 where in one activity 
(Activity 29, Subsection 4.1) we have a random sample of 15 large schools 
from the East of England region, in another (Activity 34, Subsection 5.1) 
we have a random sample of secondary school academies from the East of 
England, and so forth. Similarly, random samples are used in one-sample 
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z-tests (Unit 7, Activity 26 in Subsection 5.2 and Exercise 17 in Section 5, 
for example) and to form confidence intervals from z-tests (Unit 9, 
Activity 18 in Subsection 4.2). In order to compare the means of two 
populations it is common to take a simple random sample from each 
population, as illustrated in Unit 7 (Exercises 20 and 21, Section 6) and 
Unit 10 (Exercise 4, Section 3). 


In principle, a simple random sample should be picked by sampling at 
random from the target population. This requires a list of the target 
population (or a mechanism that can select members of the target 
population at random) and is often impractical or impossible. Often a 
sample is treated as a simple random sample because it was not selected in 
any special way. For example, in your experiment with mustard seeds in 
Unit 10, the seeds are treated as a random sample of mustard seeds even 
though you did not use random number tables to decide which shop to go 
to for the seeds, or which packet of seeds to buy when you were in the shop. 


In Subsection 2.1, we review survey methods that aim to find out about a 
population by examining just some of the items in the population, or 
questioning just some of the people in it. In Subsection 2.2, we discuss the 
collection of data for experiments. 


2.1 Survey methods 


When a large sample is to be gathered and no list of the population is 
available, then quota sampling is commonly employed so that 
characteristics of the population, such as age and gender, are reflected in 
the sample. For example, interviewers may be told to complete 
questionnaires with, say, thirty men aged 30-35, forty women aged 50-60, 
and so forth. This method might even be used when a list of the 
population is available, as it can be an economical method of gathering 
reasonably representative data. However, when a list of population 
members is available, other efficient survey methods can be employed. 
Also, given a suitable list, proper randomisation can be used to give a 
simple random sample. The following methods were discussed in Unit 4. 


Simple random sampling 


In simple random sampling, members are selected one at a time. At each 
selection, those members of the population who have not already been 
selected are each equally likely to be the one selected. Thus each selection 
is independent of earlier selections, except that no member of the 
population can be selected more than once. The selection might be based 
on numbers given by a random number table or, more commonly, by a 





computer’s randomisation procedure. In Unit 6, the module team used Statisticians fall asleep 
random numbers generated by Minitab to select a sample of schools from a faster by taking a random 
list of schools in the East of England. sample of sheep. 


Systematic random sampling 


Choosing a sample from a list of the target population is slightly easier 
using systematic random sampling rather than simple random sampling. 
If, say, one-seventh of the population is to be included in a systematic 
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sample, then every seventh person in the list would be included in the 
sample, starting from one of the first seven people in the list, chosen at 
random. The random choice of starting point means that everyone in the 
population has an equal chance of being included in the sample. 


If the population is listed in an order such that similar items or people are 
grouped together, then systematic sampling may well produce a more 
representative cross-section of the population than would be obtained by 
simple random sampling. For example, in an alphabetical listing of people, 
a husband and wife may well be listed such that one is immediately after 
the other. Then they would not both be included in a systematic sample, 
but through chance they might both be included in a simple random 
sample — which would then over-represent their family. 


The following activity is designed to refresh your skills in using random 
number tables to choose samples for the above two types of survey. 
Activity 8 Random and systematic samples 


Table 3 gives the initials of 68 people who form the target population. The 
people are grouped in six sets (A, B, C, D, E and F) and have been 
labelled 01,02,...,68. 


Table 3 A population divided into sets A-F 
Label Initials Set Label Initials Set Label Initials Set 


Ol CJ. A 24 RDM. C 4T ZG. D 
02 RAB. A 25 MC. C 48 YH. D 
03 A.P.D. A 26 E.L. C 49 KVM. E 
04 M.A. A 27 TPH © 50 P.H. E 
05 EM. A 28 ILIM. C 5 JRR: E 
06 J.L. A 29 J.S. C 5 GCT. E 
07 J.D.H. A 30 CG. C 53 DSP. E 
08 AEG. A 31 TRF. C 54 HM. E 
09 EMH. B 32 CCT. C 55 M.J.P. E 
10 T.N. B 3 DJS. C 56 C.S.T. F 
11 C.T.L. B 34 SGC C 57 AYK. F 
12 BWS. B 35 D.L. C 58 AHS. F 
13 JAR. B 36 DKB. C 59 DBM. F 
14 J.S.R. B 37 CAM. C 60 MAT. F 
15 SLL. B 38 RIJ. C 61 AND. F 
16 WNR. B 39 WOJ. C 62 J.R.H. F 
17 ACD. B 40 AHD. D 63  R.J.C. F 
18 P.J.G. B 41 PV. D 64 GTW. F 
19 W.W.S. B 42 A.L. D 65 GKS. F 
20. HT. B 43 LRP. D 66 E.D. F 
21 GBY. B 44 DAF. D 67 YSH. F 
22 DW. B 45 RHR. D 68 MB. F 
23 MS. B 46 A.T. D 


(a) Calculate the proportion of the population in each set. 


(b) Choose a simple random sample of size 17 from the population in 
Table 3 using the random number table in the appendix to Unit 4, 
starting at the beginning of row 30. 


How many people in the sample are from set A? How many from each 
of the other sets? In the sample, which sets are under-represented 
relative to their size? 


(c) Choose a systematic random sample of about a quarter of the 
population in Table 3. This time, take the first digit in row 10 in the 
range 1 to 4 as your random start. Analyse the sample with respect to 
‘set’ and comment on how representative it is of the population. 


(d) Explain whether you expected the sample obtained in (c) to represent 
the population better than the sample obtained in (b)? 


Stratified sampling 


Sometimes a population divides naturally into separate categories/ 
subpopulations, and items in the same category are likely to be more 
similar with respect to a quantity of interest than items from different 
categories. Then the categories are strata, provided each member of the 
population falls in exactly one category and we know (before gathering 
sample data) which members are in each category. 


The approach in stratified sampling is to take a subsample from each 
stratum and combine the information the subsamples yield. If the quantity 
of interest is a numerical measurement on an interval scale (such as a 
length or weight, say), then the information from subsamples is combined 
as follows to yield an overall estimate of the population mean. 


1. Taking each stratum separately, the sample mean and variance of the 
data in its subsample are calculated. These are estimates of the mean 
and variance for that complete stratum. 


2. <A weighted average of the strata means is calculated to obtain an 
estimate of the population mean. The weights are based on the 
number of data in each strata. (Formulas for the weights and the 
standard error of the weighted average are outside the scope of M140.) 


If the quantity of interest is a proportion (the proportion of the population 
who prefer brand X to brand Y, say), then the proportion in each 
subsample is determined first — to give an estimate of the proportion in 
each stratum. A weighted average of these proportions is then calculated 
and taken as the overall estimate of the population proportion. 


Cluster sampling 


Cluster sampling is almost essential when data are to be gathered through 
face-to-face interviews and the population of interest is spread across a 
large geographical area or consists of a large number of locations. Many 
surveys require interviewers to call at people’s homes or workplaces and 
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cluster sampling is a way of reducing the travel involved in conducting the 
survey. In cluster sampling: 


e The large geographical area is divided into small geographical areas — 
the clusters. (Each cluster might be an office block, or a street, say.) 


e A limited number of these clusters are selected, preferably at random 
from all the clusters. 


e Cluster samples are obtained by taking a random sample from each 
selected cluster. (For example, 20% of the people in each cluster might 
be questioned in the survey.) 


e The cluster samples are combined to form a sample from the 
population. 


In forming cluster samples, it is quite common to choose sample sizes so 
that the same proportion of each cluster is sampled. For example, the 
survey design might specify that 20% of the people in each selected cluster 
should be questioned in the survey. As long as the clusters that will be 
sampled are selected at random, each member of the target population will 
then have an equal chance of being included in the survey — which seems 
an attractive property. However, if the clusters in the population differ 
radically in size, a drawback of this approach is that the total sample size 
(and hence the cost of the sample) is not known until after the clusters to 
sample have been selected. There are variants of cluster sampling that are 
designed to avoid this problem. 


With stratified sampling, the strata subsamples must contain the same 
proportion of each stratum if each member of the target population is to 
have an equal chance of being sampled. Again, this is quite commonly 
done in practice, but there can be reasons for preferring alternatives. In 
particular, from previous surveys it may be known that certain strata 
display far greater variability than other strata. That is, the variance of 
the quantity of interest is known to be greater in particular strata. Those 
strata should then be sampled more heavily than strata in which the 
variability is less. 


Combining survey methods 
This is new material, not previously covered in M140. 


In practice, it is common to combine different survey methods in designing 
a survey. To give an example, suppose a sample of staff working in 
hospitals across the UK is required. If it is thought that regional variations 
are likely, then the following survey design might be appropriate. 


e Divide the UK into regions: say Scotland, Northern England, Wales, 
the Midlands and so on. Each region is a stratum and hospital staff 
from each stratum must be included in the sample. 


e Within each region there are a lot of hospitals and so, to reduce the 
travelling a survey interviewer must do, each hospital might be treated 
as a cluster. Then, within each stratum, cluster sampling would be 
used to determine which hospitals the interviewer would visit. 


e Suppose the survey must question nurses, doctors and administrators. 
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To ensure a balanced sample is taken, these three categories of staff 
might be treated as three strata, and a random sample taken from 
each. 


The above survey design might seem unnecessarily complicated, but 
surveys can be expensive. Moreover, large surveys are often repeated at 
regular intervals, so efficient and effective design is important. In Unit 2, 
some information was given about the survey methods used to obtain data 
for the Retail Price Index (RPI). For the RPI, the UK is divided into 
twelve regions (strata) and shopping locations in each region are placed 
within size categories. 


e The size categories (another level of strata) are based on factors such 
as the size of the shopping population, the number of shops, and the 
drive-time to the shopping centre. 


e Location selection takes place separately within each region, using a 
form of systematic sampling within strata. The outlets in a selected 
location are listed and the commodities each outlet sells are coded. 
The outlets are sampled in such a way as to obtain a number of prices 
for each item in the large basket of goods on which the RPI is based. 
(Some items in the basket of goods are sampled centrally, rather than 
through this procedure, but only for about 140 items out of 700.) 


This procedure sounds complicated ... and the detail around the separate 
sampling stages is substantially more complicated! 


Activity 9 Sampling an urban area for interview 


The Social Services department of a large unitary authority wishes to 
investigate whether disabled people in its urban area are receiving 
sufficient support. They plan to carry out a sample survey and decide to 
use as a sampling frame a list prepared for the purpose of collecting the 
council tax. This list includes every house in the urban area with its 
address, the house’s band for council tax purposes (which is a measure of 
its estimated selling price), whether the house is owned privately or by the 
council, and the parish (the urban area is divided into 27 parishes) in 
which it is situated. The list contains about 60000 houses. 


The Social Services department wants to select a sample of about 

3000 houses and will send an interviewer to each selected house. They will 
ask if any disabled people live at the house and, if so, ask about the 
support they receive. 


(a) Giving reasons for your choices, describe how the Social Services 
department might use a procedure involving some forms of stratified 
sampling, cluster sampling and simple random sampling to select the 
sample of 3000 houses. (It is not necessary to use all the methods.) 


(b) State whether your procedure would be likely to increase or decrease 
the sampling error compared with that of a simple random sample of 
3000 houses. What is the main benefit of your proposal compared with 
simple random sampling? 
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2.2 Collecting data in experiments 


An experiment involves making specific observations under specific 
conditions in order to answer specific questions. There are many kinds of 
experiments, including: 


e exploratory (Baconian) experiments, which aim to answer questions 
such as ‘What happens if ...?’ 


e measurement experiments 


e  hypothesis-testing experiments, which aim to test a specific hypothesis 
— often one about the cause of a phenomenon. 





A Bacon(ian) experiment 


In M140 we have concentrated on the third type of experiment. Most of 
the experiments have investigated the effect of a particular treatment of 
some sort on people, animals, plants or some kind of object. They include 
whether the roots of mustard seeds grow more in the light than in the 
dark; whether the weight gain on diet A differs from that on diet B; 
whether a new drug is more effective than a placebo; and which dose of 
drug gives the best combination of effectiveness and low risk of adverse 
side effects. The items or individuals from a population that are included 
in an experiment are referred to as the experimental units. The 
investigations in M140 often involved comparing experimental units that 
had been exposed to the experimental treatment (the experimental group) 
with other experimental units that had not been exposed to the treatment 
(the control group). Otherwise, the experiments typically involved 
experimental groups that had been exposed to one of two treatments. 


In designing experiments, one aim is to give fair comparison of the 
different treatments being examined. 


You have met various strategies that help achieve this aim. 


e Randomisation is the most important of these strategies. This 
allocates treatments to experimental units by chance, so no treatment 
can be deliberately favoured by being tested on the more responsive 
units. Hence, for example, in the mustard seed experiment you tossed 
a coin to decide which of the two pots of seeds would grow in the dark 
(step 12 in Subsection 2.3 of Unit 10). 


e Apart from the characteristic being examined, the different treatments 
are made to resemble each other as much as possible. Thus placebos 
are used in clinical trials (and could be used in other forms of trial) so 
that ‘treatment’ and ‘no treatment’ appear the same to patients. For a 
similar reason, in the mustard seed experiment, the group of seeds 
grown in the light were covered with a piece of clear plastic so that 
they experienced similar levels of humidity as the seeds grown under 
aluminium foil. 


e Double-blind trials are used so that knowledge of which treatment a 
patient is receiving cannot influence a patient’s response or a 
clinician’s perception of that response. This again aids a fair 
comparison of treatments. 


Activity 10 Double-blind trials and placebos 


Very briefly explain which of the following statements about double-blind 
trials are true and which are false. 


A. In a double-blind clinical trial, as many as possible of those carrying 
out the trial, and of the patients receiving the treatment, must know which 
patients have received the active drug and which the placebo. 


B. In a clinical trial where all measurements being made are objective 
rather than subjective, it is not necessary to use double-blind procedures. 


C. In a clinical trial where all measurements being made are subjective 
rather than objective, it is not necessary to use double-blind procedures. 


D. In a double-blind trial, the doctors who are administering the placebo 
and the patients who are receiving it all know that it is the placebo, 
whereas the doctors who are administering the drug being tested and the 
patients who are receiving it do not know whether they are dealing with 
the drug or the placebo. 


E. As far as possible, all controlled clinical trials should be double-blind. 


F. In a double-blind trial, the patients should never be told of the 
possibility that they might receive a placebo. 


G. In a double-blind clinical trial, neither the patients nor the doctors 
know which patients have received the drug being tested and which the 
placebo. However, an appropriate independent person does have this 
information. 
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Randomisation of treatments to experimental units will mean that 
treatment allocation is impartial — no treatment is deliberately favoured. 
However, differences in the experimental units might favour one treatment 
over another and so introduce bias. So: 


Another aim in designing experiments is to remove or reduce 
potential sources of bias. 


Thus, for example: 


e Inthe mustard seed experiment the two pots were placed side by side 
(step 13 of the experiment) so that the two pots were at a similar 
temperature. For the same reason, you swapped the positions of the 
two large containers each day (Subsection 2.4 of Unit 10). 


e The Clackmannanshire experiment examined three different methods 
of teaching children to read. Teaching methods are applied to whole 
classes, rather than individual children, so care was taken in selecting 
the participating classes and schools to ensure that the children taught 
by each method were broadly comparable (Subsection 1.3, Unit 8). 


e Many experiments (notably many clinical trials) are 
group-comparative trials in which people/patients are allocated to 
treatments at random, but with restrictions, so that the overall 
characteristics of each treatment group are similar for qualities that 
are thought to matter. If gender and age were thought to influence a 
person’s response to treatment, then care would be taken to ensure 
that the control group and treatment group (or treatment groups) 
contained similar male-to-female ratios and that each group had 
similar age profiles. 


Activity 11 Group-comparative trial of an arthritis drug 


A drug company wishes to carry out a clinical trial on a new drug that, it 
is hoped, will alleviate the symptoms of arthritis. A design is chosen in 
which 20 arthritis sufferers are allocated to the trial. Half of the patients 
are to receive the new drug for four weeks (the experimental group), and 
the other half are to receive an existing drug for the same period of 

four weeks (the control group). The ten patients for the experimental 
group are chosen as a simple random sample from the list of 20. The 
remaining ten patients form the control group. 


Once the allocation has been carried out, the research staff running the 
clinical trial discover to their dismay that all the patients allocated to the 
experimental group turn out to be women and all those allocated to the 
control group turn out to be men. 


(a) Explain what characteristics of this experiment make it a 
group-comparative trial. 


(b) Explain why the allocation of all the women to the experimental group 
and all the men to the control group might upset the validity of the 
clinical trial. 


(c) If the research staff were to abandon this trial and start again with 20 
new patients, how might they alter their allocation procedure to 
ensure that such a problem could not arise? 





Random variation will affect differences that are observed between 
treatments, so a third aim when designing an experiment is to try to 
reduce this variation. 


Quite commonly, an experiment is designed to yield pairs of closely related 
measurements: the difference between the measurements in each pair is 
calculated and these differences form the data that are used in testing 
hypotheses or forming confidence intervals. The differences typically have 
a much smaller variance than the original measurements. The following are 
instances from earlier units where differences between pairs were formed. 


e In Exercise 9 of Unit 6 (Section 5), the change in degree of depression 
after taking methadone (before — after) was the data used for a 
hypothesis test of whether methadone had any effect, rather than just 
the degree of depression after taking the drug. 


e Matched-pairs t-tests are based on the differences within pairs of 
observations. A pair of measurements might be the weights of an 
object given by two different weighing machines (Activity 22 in 
Subsection 4.2 of Unit 10) or resting heart rate before and after an 
exercise program (Activity 26 in Section 6 of Unit 10), for example. 


e [n clinical trials, matched pairs look at differences between pairs of 
patients who are closely matched (twins ideally!). One treatment is 
given to one member of the pair and the other treatment to the other 
member. Often, though not always, the data will be analysed using a 
paired t-test. 
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e In a crossover design, each person taking part in the trial is given both 
treatments. Thus, each individual acts as their own control, thereby 
reducing variability. The differences between a person’s responses 
under the two treatments are the data analysed. 


When designing an experiment, a common question is: How much data 
should be collected? Often the experimenter is hoping to minimise the 
amount of data he or she has to collect. However, the experimenter should 
also consider the amount of time that will be spent preparing for the 
experiment, analysing the data statistically and then writing reports or 
papers about the results. These activities will take a significant amount of 
time — several weeks or months if the work is to be reported as a project or 
published in a scientific journal. In such cases, gathering data is just a 
small part of the process — but the quantity of data that is gathered is 
often crucial to the value of an experiment. The chance of rejecting a null 
hypothesis that is false is dependent on the amount of data gathered. 
Hence: 


Good advice for when you are designing an experiment is that you 
should aim to gather as much data as you reasonably can. 


Exercises on Section 2 





Exercise 1 Stratified sampling and cluster sampling 


(a) Consider again the population in Table 3 (Subsection 2.1). Suppose 
sets A and B can sensibly be considered as one stratum, sets C and D 
as a second stratum, and sets E and F as a third stratum. A stratified 
sample is required in which six people are selected from the first 
stratum, six from the second stratum, and five from the third stratum. 
(The third stratum is smaller than the others.) Start in row 36 of the 
random number table (Unit 4 appendix) and pick a stratified sample. 
Give the labels of the people in the sample. 


(b) Suppose that the sets in Table 3 are groups of people who are widely 
separated geographically, so that cluster sampling is the obvious 
survey method to use. Restricting your sample to just two of the sets, 
sample approximately one-third of the individuals in each cluster. 


Obtain the sample using the following procedure: 


e Label the sets A-F from 1 to 6. Using single random digits, and 
starting at the beginning of row 95 of the random number table, 
select the two sets to be sampled. These sets are to be sampled in 
the order in which they are selected. 


e Determine the sizes of the samples to take from each cluster by 
dividing each cluster size by 3 and rounding the results up to 
whole numbers. 


e To select individuals for the subsample from the first selected set, 
use pairs of digits starting at row 72 of the random number table. 


No person may be selected more than once. To select individuals 
from the second selected set, continue from the point reached in 

the random number table after selecting the first subsample, and 
apply the same procedure again. 


List the people chosen in the subsamples. 


Exercise 2 Survey of NHS trust employees 


A large NHS trust wishes to survey a sample of its 6000 employees about 
job satisfaction. In addition to each employee’s payroll number, the 
employee database also includes information about each employee’s age 
and grade. It is thought that both age and grade might relate to job 
satisfaction. 


(a) Describe how the procedures of stratified sampling and systematic 
sampling might together be used to choose a sample of size 500 that 
represents an appropriate balance of the workforce. 


(b) Identify two non-sampling errors that may arise in the survey. 





Exercise 3 Other trials of an arthritis drug 


Suppose that, as in Activity 11 (Subsection 2.2), a drug company wishes to 
carry out a clinical trial to compare the usefulness of two drugs (a new 
drug and an existing drug) for alleviating the symptoms of arthritis. 


(a) Describe briefly how researchers could use a crossover design in a 
clinical trial comparing the two drugs. 


(b) Describe briefly how researchers could use a matched-pairs design in a 
clinical trial comparing the two drugs. 


(c) Which of the two designs would you consider the more suitable? 
Would a group-comparative trial be better? 


3 Probability 


In this section, we first review some properties of probability that were 
given in Units 6 and 8. We then consider the binomial distribution and the 
probabilities that it gives. You used a specialised form of the binomial 
distribution to test hypotheses about the median in Unit 6, and here the 
general form of the distribution is described. 


3.1 Basic properties 


The data in Table 4 will be used to illustrate the basic properties of 
probability. It concerns academic statisticians working in UK universities 
in 2013. They are separated into five age-bands (under 30, 30-39, 40-49, 
50-59, and 60 or over) and the table gives the number of them in each job 
category and age-band. 
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Table 4 Academic statisticians in UK universities 


< 30 30-39 40-49 50-59 >60 Total 





Research Fellow 102 82 31 5 7 227 


Lecturer 13 150 59 11 6 239 
Senior Lecturer 0 35 70 43 8 156 
Professor 0 8 51 66 47 172 
Total 115 275 211 125 68 794 


‘Research fellow’ includes research assistants as well as research fellows, 
and ‘senior lecturer’ includes readers as well as senior lecturers. 


Definition of probability 
A simple definition of probability is to equate it to a proportion: 
Probability = Proportion. 


More precisely: 


Probability of an event 


Let E stand for the event of selecting a person or object with some 
particular property from a population using random sampling. Then 
the probability of E is given by 





P(B) = Number in population with particular property 
7 Total number in population f 


Example 1 Probability of picking a lecturer 


There are 239 lecturers in the population of 794 academic statisticians in 
Table 4. Thus, the proportion of lecturers in the population is 


239 
— ~ 0.301. 
794 


Hence, if we pick a person at random from the academic statisticians, the 
probability that we pick a lecturer is 0.301. 





Activity 12 Picking a research fellow 

Suppose an academic statistician is picked at random. What is the 
probability that the person is: 

(a) a research fellow? 

(b) aged 30-39? 

(c) a research fellow aged 30-39? 


Prominent statisticians from the University of Cambridge 
Statistical Laboratory, 1953 


The figure below shows staff and postgraduate students of the 





Some of the people shown were already very well known statisticians 
in 1953, and others became prominent statisticians later. You will 
hear of some of them if you study statistics further, including David 
Cox (now Sir David Cox, FRS, FBA), John Wishart (first Director of 
the Statistical Laboratory), Frank Anscombe and Dennis Lindley, 
who are respectively third to sixth from the left in the second row. 


Conditional probability 


Sometimes we want probabilities for subpopulations. For example, Smith 
(a statistician) has become a professor at the age of 37 and wants to know 
the probability that an academic statistician is a professor if they are in 
the 30-39 age-band. Thus academic statisticians aged 30-39 form the 
subpopulation of interest. 


From Table 4, there are 275 in this subpopulation, of whom 8 are 
professors. Hence among statisticians aged 30-39, the proportion who are 
professors is 
8 
— ~ 0.029. 
275 
Thus 


P(professor|aged 30-39) œ 0.029. 


The conditional probability of A, given B, denoted P(A|B), is the 
probability that A occurs, given that B occurs. 
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Hoping to avoid the 
difficulties of using 
conditional proba blility, 
Thomas Jefferson writes the 


Declaration of Independence. 
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Activity 13 Senior lecturer aged 40-49 years old 


Suppose an academic statistician is picked at random. 


(a) If the person is aged 40-49, what is the probability that they are a 
senior lecturer? 


(b) If the person is a senior lecturer, what is the probability that they are 
aged 40-49? 


Independence 


Independence is an important concept in statistics. Most of the hypothesis 
tests in M140 require observations to be independent. In particular, this is 
true of the sign test, the one-sample and two-sample z-tests, and t-tests for 
one sample or two unrelated samples. In such contexts, the everyday 
notion of independence corresponds to the statistical definition. We are 
assuming that the value taken by one (random) observation has no 
influence on the value taken by any other random observation. That is, as 
in standard English, two things are ‘independent’ if they are unrelated. 


More precisely, we define statistical independence in terms of probabilities. 


Two events A and B are statistically independent if the occurrence of 
one has no influence on the chance of occurrence of the other. Then 


PAB) = P(A). 
[It can be shown that if P(A|B) = P(A), then P(B|A) = P(B).] 


Events that are unrelated are also statistically independent. For example, 
if you roll a die and toss a coin, the event ‘the die gives a three’ is both 
physically independent and statistically independent of the event ‘the coin 
lands heads’. However, statistical independence is a numerical property of 
probability, and events can be statistically independent without being 
physically disconnected. This is illustrated in the following example. 





Example 2 Rolling a die 


Suppose an ordinary six-sided die is rolled. Assuming it is unbiased, it is 
equally likely to roll a 1, 2, 3, 4, 5 or 6. Suppose it is rolled once and 
consider the following events, 


e A: it rolls an even number (i.e. 2, 4 or 6). 

e B: it rolls a 4 or more (i.e. 4, 5 or 6). 

e C: it rolls a 3 or more (i.e. 3, 4, 5 or 6). 

Which of these events are independent? Well, 
3 1 


as three of the six possibilities result in A occurring. 


Now if we know that B has occurred, then our population of possible 
outcomes reduces to 4, 5 or 6. Each of these three events is equally likely, 
and event A occurs if the roll was actually a 4 or 6 (but not if it was a 5). 
Hence, 


P(AIB) = A 


As P(A) does not equal P(A|B), the probability that A occurs is affected 
by whether or not B occurs. Thus events A and B are not independent. 


Suppose, instead, that we know that C has occurred. Then our population 
of possible outcomes is 3, 4, 5 or 6. Each of these four outcomes is equally 
likely, and event A occurs if the roll was a 4 or 6 (but not if it was a 3 or 
5). Hence, 

2 1 

4 2 

As P(A) does equal P(A|C), the occurrence of C has no influence on the 
probability that A occurs. Thus events A and C are independent, even 
though the same physical quantity — the outcome of rolling a die — 
determines them both. 


P(A|C) 


Lastly, for events B and C, 


4 2 


as four of the six possibilities result in C occurring. Also, if we know that 
B has occurred, then our population of possible outcomes is 4, 5 or 6. 
Regardless of which of these outcomes has occurred, C will occur. That is, 
C is certain to occur if B occurs, so 


P(C|B) =1. 


As P(C) does not equal P(C|B), the probability that C occurs is affected 
by whether or not B occurs. Thus events B and C are not independent. 





Joint probabilities (the ‘and’ linkage) 


The joint probability of A and B, denoted P(A and B), is the probability 
that both A and B occur together. 


Joint probabilities 


Let A and B be any two events. Joint and conditional probabilities 
are linked by the following relationships: 


P(Aand 3B) = P(A) x Pi B\A)= PB) P(AlB). 
Also, if A and B are independent events [so P(B|A) = P(B)], then 
P(A aud B)= Pl ARTE): 
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Activity 14 Picking a professor 
Suppose an academic statistician is picked at random. 


(a) How many academic statisticians are professors aged 50-59? Hence 
show that 0.083 is the probability that the randomly picked person is a 
professor aged 50-59. 


(b) What is the probability that the randomly picked person is a professor? 


(c) If the randomly picked person is a professor, what is the probability 
that they are aged 50-59? 


(d) Define events A and B as follows. 


A: arandomly picked academic statistician is a professor 


B: a randomly picked professor is aged 50-59. 
Use (a), (b) and (c) to show that 
P(A and B) = P(A) x P(B|A). 


Adding probabilities (the ‘or’ linkage) 


We say that the event ‘A or B’ occurs if (i) A occurs, or (ii) B occurs, or 
(iii) both A and B occur. The ‘or’ linkage leads to the addition of 
probabilities and has a simpler form when events are mutually exclusive. 


Two events are said to be mutually exclusive if they cannot occur at 
the same time. More generally, any number of events are said to be 
mutually exclusive if no two of them can occur at the same time. 


If events A and B are mutually exclusive, then P(A and B) = 0, as they 
cannot both occur at the same time. 


Addition rules for probabilities 

Let A and B denote two events, mutually exclusive or not. Then 
P(A or B) = P(A) + P(B) — P(A and B). 

If A and B are mutually exclusive, 


P(A or B) = P(A) + P(B). 
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Activity 15 The great and the good +a 


(a) Suppose a statistical organisation wishes to invite ‘the great and the 
good’ to an event it is hosting. Any UK academic statistician who is a 
professor or is aged at least 60 will be invited. Using Table 4, explain 
why 193 UK academic statisticians will be invited. How many people 
are double-counted if you simply add the number of professors to the 
number of academic statisticians aged 60 or over? 


(b) At a more select gathering, only older professors are invited (professors 
aged at least 60). How many people from UK universities are invited? 


(c) Define events A and B as follows. 
A: a randomly picked academic statistician is a professor 
B: arandomly picked academic statistician is aged 60 or over. 


Use (a) and (b) to calculate the probabilities P(A or B) and 
P(A and B). 


Also calculate P(A) and P(B). Hence show that 
P(A or B) = P(A) + P(B) — P(A and B), 


=~ 
= 


in line with theory. 


We have focused on the case where there are exactly two events. The 
following box gives useful results for the case where there are four events, 
which can be easily generalised to other numbers of events. 


Sets of events 
If A, B, C and D are independent events, then 

P(Avand 6 and C and 2) = P(A) x P(B) x P(e) x Pip), 
If A, B, C and D are mutually exclusive events, then 


P(A or B or C or D) = P(A) + P(B) + P(C) + P(D). 


3.2 The binomial distribution 


This subsection contains new material, not previously covered in 
M140. 


You encountered the binomial distribution in Unit 6. It is an important 
distribution in statistics so here we consider it further. 


In Unit 6 (just before Activity 22 in Subsection 3.1), we determined that 
the number of ways in which a committee of 3 people can be chosen from 
10 members of a club is equal to 

10x9x8 

322% 1° 
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This led to the following result. 


Number of combinations 
Suppose there are n objects to choose from. Then the number of ways 
of choosing x objects if the order does not matter is 
nG e E eel) 
A ox (Gl) x eee al 3 





(There are x terms in both n x (n — 1) x --- x (n — x + 1) and 
Ex (a=) x e x L) 


For any value of n, "Co and "C,, are defined as being equal to 1. 


The result will be used to obtain the probabilities given by a binomial 
distribution. The following activity will refresh your memory of using it. 


+a Activity 16 Socks for the weekend 


(a) You have six clean pairs of socks and must pack three pairs in your 
bag to go away for the weekend. In how many ways could you choose 
the three pairs to take? 


(b) A month later you are going away for a longer break and must pack 
four pairs of socks. If you have nine clean pairs, how many choices do 
you have? (Either you have got better at doing the washing or you 
have bought some more socks!) 





Suppose that a defence lawyer has a success rate of 0.3. That is, in 30% of 
the trials in which he is the defence lawyer the defendant is acquitted. Let 
S denote success (the defendant is acquitted) and F denote failure (the 
defendant is found guilty). Then 


P(S)=0.3 and PP) H=1-03= 07. 


A sequence of five trials might give the sequence SFF SS, where this 
means that the first trial was a success for the lawyer, the next two were 
failures, and the last two were successes. If we assume that outcomes are 
independent of each other, then 


P(SFFSS) = 0.3 x 0.7 x 0.7 x 0.3 x 0.3 = 0.33 x 0.77. 


Other sequences that gives 3 successes and 2 failures include SS'SF'F and 
FSSSF, with probabilities 


P(SSSFF) = 0.3 x 0.3 x 0.3 x 0.7 x 0.7 = 0.3? x 0.77 
and 
P(FSSSF) = 0.7 x 0.3 x 0.3 x 0.3 x 0.7 = 0.3? x 0.77. 


Clearly, the probability of any sequence that gives 3 successes and 
2 failures is 0.33 x 0.77. Suppose now, that we want the probability that 
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exactly 3 of the next 5 trials will be a success (in any order). It follows 
that this probability equals: 


(number of sequences of 5 trials giving 3 successes) x 0.33 20.7. 
As °C3 is the number of sequences of 5 trials giving 3 successes, 


P(3 successes in 5 trials) = °C3 x 0.3° x 0.77 
_5x4x3 


~3x2xi1 
= 0:1323. 


x 0.027 x 0.49 


Activity 17 Sales targets 


E+ 
Li 


A salesman has a daily target for the number of sales he should make in a 
day. Let S denote success (he reaches his target) and F denote failure. 
Assume that the probability of S on any day is 0.6 (so P(F) = 0.4) and 
that results on different days are independent of each other. 


(a) What is the probability that in the next 6 days his sequence of 
successes and failures is FSSSF'S? 


(b) How many different sequences of 6 days give 4 successes and 2 failures? 


(c) What is the probability that the salesman meets his target in exactly 4 
of the next 6 days? 


(d) What is the probability that the salesman meets his target in exactly 5 
of the next 8 days? 


The probabilities you calculated in parts (c) and (d) of Activity 17 are 
probabilities from a binomial distribution. More generally, the binomial 
distribution arises in the following situation. 


1. There is a fixed number of trials, where a trial is an event that can 
result in success (S) or failure (F). Let n denote the number of trials. 


2. The probability of success in a trial is the same for all trials and the 
result of any trial is independent of the results of other trials. Let p 
denote the probability of success and q = 1 — p denote the probability 
of failure. 


3. Let x denote the number of successes in the n trials. Then the 
probability distribution of x is a binomial distribution. 


In the example with lawyers a ‘trial’ was actually a real trial. We were 
interested in the number of successes in 5 trials, so n equals 5, and the 
probability of success in any one trial was 0.3, so p = 0.3 and q = 0.7. 


In the example with the salesman in Activity 17, a ‘trial’ was a day’s sales 
performance and success equated to making the daily target. Thus p = 0.6 
was the probability of success and q = 0.4 was P(F’). In part (c) of 
Activity 17, we were concerned with a 6-day period, so n was 6, and we 
wanted P(4 successes) = P(x = 4). In (d), n was 8 and we wanted 
Pe=b). 
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The following defines the binomial distribution. 





E 
TS 


Forty-four per cent of the population of the UK are blood group O. If 
seven people are picked at random, what is the probability that exactly 
two of them are blood group O? 


‘Success’ is the name 
of a suburb of Perth, 
Western Australia 





In Unit 6, we were interested in the number of observations in a sample 
that would exceed the population median. Consider each observation as a 
trial and equate success to ‘the observation exceeds the population 
median’. Then, if there are x successes and n observations, 


Po\= "Ce ea, 
From the definition of the median, p = P(success) = i and q=1-p= 


1 

5 . 
Thus the probability that there are x values above the population median 
in a sample of n observations is 


"cw (3) aG] tar] 


This formula (the formula for a binomial probability when p = 5) is the 
special case of the binomial distribution used in Unit 6. 


You have now covered the material related to Screencast 2 for 
Unit 12 (see the M140 website). 


You have also now covered the material needed for 
Subsection 12.1 of the Computer Book. 


Exercises on Section 3 





Exercise 4 Age and job category independent? 


Consider Table 4 (Subsection 3.1) and suppose an academic statistician is 
picked at random. 


(a) Are the events ‘the person is aged 40-49’ and ‘the person is a lecturer’ 
independent? 


(b) Use the formula P(A and B) = P(B) x P(A|B) to obtain the 
probability that the person is a lecturer aged 40-49. 


(c) What is the probability that the person is a lecturer or aged 40-49 or 
both? 


Exercise 5 Binomial probabilities 


Suppose the probability of success in a trial is 0.2 and that trials are 
independent of each other. 


(a) If there are four trials, what is the probability that there is exactly one 
success? 


(b) If there are five trials, what is the probability that there are exactly 
two successes? 


(c) If there are seven trials, what is the probability that there are no 
successes? 


Exercise 6 Relief from headache 


The probability that a person with a headache will feel better within an 
hour of taking a particular headache tablet is 0.6. If six people with 
headaches take the tablet, determine the following probabilities, stating 
any assumptions you make. 


(a) The probability that exactly two of them feel better within an hour. 
(b) The probability that two or fewer of them feel better within an hour. 
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4 Hypothesis testing and contingency 


tables 


Hypothesis testing is an important use of statistics. In this section we 
review the structure of a hypothesis test, illustrating the main ideas by 
describing the x? test for contingency tables. 


The following is the procedure in a hypothesis test. 


1. 


The first step is to state the null hypothesis, Ho, and the alternative 
hypothesis, Hı. We provisionally assume that Ho is true and (usually) 
hope to find that the data cast strong doubt on this assumption. If 
that happens, then we will have evidence against the null hypothesis. 


From the data, we construct a test statistic whose value relates to the 
null hypothesis. Often the value of this statistic should be small if Ho 
is true, so that a large value casts doubt on Ho. 


Lots of different test statistics may relate to Hp. We choose one whose 
probability distribution is fully known if Ho is true. For example: in 
Unit 7, we used test statistics that followed a normal distribution if 
Hp was true; in Unit 8, test statistics followed a x? distribution if Ho 
was true; in Unit 10, most of the test statistics followed a 

t distribution if Hp was true; while in the sign test in Unit 6, the test 
statistic follows a binomial distribution with p = q = f if Ho is true. 


The possible values that the test statistic might have taken are 

divided into two sets: 

e Set A contains those values that are at least as unlikely as the 
value given by the sample data, if Hp holds. 


e Set B contains values that are relatively likely to occur if Ho is 
true. Specifically, it contains those values that are more likely to 
occur, if Ho is true, than the actual value given by the sample 
data. 


The actual value taken by the test statistic is in set A. We determine 
the probability of observing a value from that set. This probability is 
the p-value from the test. If it is small then either a very unusual 
event has happened, or Ho is incorrect. 


The p-value is interpreted as follows: 


p > 0.10 Little evidence against Ho 

0.10 > p > 0.05 Weak evidence against Ho 

0.05 > p> 0.01 Moderate evidence against Ho 
0.01 > p > 0.001 Strong evidence against Ho 
0.001 > p Very strong evidence against Ho 


Unless we are using Minitab (or some other statistical software) we will 
seldom know the exact p-value. Instead the test statistic is compared 
with critical values given in an appropriate table (such as Table 28 in 
Subsection 4.4 of Unit 8 for x? tests, or Table 2 in Subsection 3.3 of 
Unit 10 for t-tests — versions of both are in the Handbook). This 
determines the significance level at which Ho can be rejected. 


4 Hypothesis testing and contingency tables 


7. The conclusion from the hypothesis test is summarised in plain 
English. For example, the conclusion might be: There is strong 
evidence that the new method is better than the old method. 


We next consider the hypothesis test of independence between the row and 
column variables of a contingency table, using the following example. 





Example 3 Coffee consumption 


A researcher wanted to examine whether a person’s coffee consumption 
was associated with their age. She selected a random sample of 180 people 
and classified each person according to whether their age was under 35, 
35-60 or at least 61, and whether their coffee consumption was low, 
medium or high. Results are given in Table 5. (The data are artificial.) 


Table 5 Level of coffee consumption by age group 


Coffee consumption 
Low Medium High Total 


Under 35 24 41 25 90 
35-60 4 32 24 60 
61 and over 8 I, 5 30 
Total 36 90 54 180 





Coffee consumption by age group 


The researcher was interested in whether there is an association between 
age and the level of coffee consumption, or whether these variables are 
independent. When examining these questions using the data in a 
contingency table, the null hypothesis is that the row and column variables 
are independent. This assumption of independence enables the Expected 
value in each cell to be calculated. 


The alternative hypothesis is that row and column variables are not 
independent. Making the assumption ‘row and column variables are not 
independent’ would not enable us to calculate the Expected values in the 
cells, so that must not be chosen as the null hypothesis. 


In this example the hypotheses are: 


Ho: Age group and coffee consumption are independent. 


Hı: Age group and coffee consumption are not independent. 
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For the hypothesis test, Expected values are needed. If the variables are 
independent and we treat the row totals and column totals as fixed, what 
values might you expect for the rest of Table 5? That is, suppose we knew 
the values in Table 6 (and nothing else), what would be reasonable 
estimates of other values in the table? 


Table 6 Row and column totals 


Coffee consumption 
Low Medium High Total 


Under 35 90 


35-60 60 
61 and over 30 
Total 36 90 54 180 


Well, half the people in the sample are aged under 35, so we would expect 
half of the people with a low level of coffee consumption to be under 35, 
half the people with medium coffee consumption to be under 35, and half 
the people with high coffee consumption to be under 35. Hence the 
Expected values for the top row of the table are 36/2 = 18, 90/2 = 45 and 
54/2 = 27. 


Similarly, one-third of the people in the sample are aged 35-60, so we 
would expect one-third of the people with a low level of coffee consumption 
to be in that age group (36/3 = 12), one-third of the people with medium 
coffee consumption to be in that age group (90/3 = 30), and one-third of 
the people with high coffee consumption to be in that age group 

(54/3 = 18). For the ‘61 and over’ group the Expected values are one-sixth 
of 36, 90 and 54 for the three levels of coffee consumption (i.e. 6, 15 and 9). 


These values form the Expected table: 
Low Medium High Total 
Under 35 18 45 27 90 


35-60 12 30 18 60 
61 and over 6 15 9 30 
Total 36 90 54 180 


The general formula for calculating the Expected values is usually written 
as: 
Row total x Column total 
Overall total 
Thus, the Expected value for the first cell would usually be obtained from 
90 x 36 _ 
180 


and similarly for the other cells. (Check that the Expected values are all 
greater than or equal to 5, so that the x? test is valid.) 


Why might the data cast doubt on Hp? Well, the values in the Expected 
table, which were calculated under the assumption that Hp is true, are not 


Expected value = 


18 





4 Hypothesis testing and contingency tables 


equal to the values observed in the sample, given in Table 5. The 
differences are the residuals. That is, 


Residual = Observed — Expected. 


The Residual for the first cell is 24 — 18 = 6 and the following is the 
complete Residual table. 


Low Medium High 
Under 35 6 —4 —2 


35-60 —8 2 6 
61 and over 2 2 —4 


Now, even if row and column variables are independent (so that Ho holds), 
the residuals would seldom all equal 0, because sample data is affected by 
random variation. The question is whether the residuals are so large that 
we should reject Ho. To answer this question, we must combine the 
information provided by the different residuals to form a test statistic. The 
distribution of this statistic must be fully known when Hp is true. To this 
end, we calculate the x? contribution of each cell and add these together. 
A cell’s x? contribution is 

(Residual)? 

Expected ` 


So the x? contribution of the first cell is 67/18 = 2. 


The complete table of x? contributions is: 


Low Medium High 


Under 35 2 0.3556 0.1481 
35-60 5.3333 0.1333 2 
61 and over 0.6667 0.2667 1.7778 


The x? test statistic is the sum of the nine x? contributions: 


x? = 2 + 0.3556 + 0.1481 + 5.3333 + 0.1333 + 2 + 0.6667 
+ 0.2667 + 1.7778 
~ 12.682. 
If Ho is true, then the test statistic follows a x? distribution. For an 
Observed table with r rows and c columns, 
degrees of freedom = (r — 1) x (c — 1). 


Here the Observed table is a 3 x 3 table, so its degrees of freedom are 

(3 — 1) x (3 — 1) = 4. Hence we compare the test statistic (12.68) with a 
x? distribution on 4 degrees of freedom. From Table 28 in Subsection 4.4 
of Unit 8 (and in the Handbook), the 5% and 1% critical values are 

CV5 = 9.488 and CV1 = 13.277. 


Since 12.682 > 9.488, we reject Ho at the 5% significance level but, as 
12.682 < 13.277, we do not reject Ho at the 1% significance level. 
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We conclude that there is moderate evidence that the level of coffee 
consumption varies with age, but the evidence is not strong. 


Examination of the y? contributions shows that the biggest value is for the 
low level of coffee consumption in the 35-60 age range. The residual for 
this cell is negative. That is, the number of 35- to 60-year-olds having a 
low level of coffee consumption is smaller than expected under Ho. This is 
the main sample evidence of departure from independence. 





Activity 19 Alternative test statistic? 


To decide whether the Residuals are so large that we should reject Ho, we 
formed the test statistic 


J (Residual)? 
Expected ` 


Suggest a reason for not using the simpler quantity 


> (Residual)? 


as the test statistic. 


Exercises on Section 4 





bev} Exercise 7 Study of air pollution 


In a study of air pollution, a random sample of 100 households was 
selected in each of four localities. Each householder was asked if one or 
more members in the household were concerned by the level of air 
pollution. A summary of the responses is given in Table 7. 


Table 7 Households with a concern about air pollution 


One or more members 


concerned 
Locality Yes No Total 
A 25 75 100 
B 16 84 100 
C 12 88 100 
D 30 70 100 


Total 83 317 400 


A hypothesis test is required of whether concern about air pollution varies 
with locality. 


(a) Write down the null and alternative hypotheses. 
(b) Obtain the Expected table. 


(c) Hence obtain the Residual and x? contributions tables. 
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(d) Calculate the x? test statistic and note the appropriate critical values, 
CV5 and CV1 (using Table 28 in Subsection 4.4 of Unit 8 and in the 
Handbook). 


(e) State your conclusions. 





5 z-tests, t-tests and confidence 
intervals 


Testing hypotheses about a population mean and forming a confidence 
interval for a population mean are common statistical tasks. So are testing 
hypotheses about the difference between the means of two populations and 
forming a confidence interval for that difference. These tasks are discussed 
in Subsection 5.2. Before that, in Subsection 5.1, underpinning results 
related to a normal distribution are reviewed. 


5.1 The normal distribution 


A normal distribution that has a mean u = 5 and a standard deviation 

o = 2 is shown in Figure 5(a). The distribution is bell-shaped and, as it is 
symmetric, the median and mode also equal 5. Figure 5(b) shows the 
standard normal distribution, which has mean u = 0 and standard 
deviation o = 1. 


NORMAL DISTRIBUTION 





PARANORMAL 
BISTRIBUTION 





copyright Matthew Freeman 





Figure 5 (a) A normal distribution with mean 5 and standard 
deviation 2; (b) a standard normal distribution. 
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Comparison of the two figures illustrates that all normal distributions have 
identical shapes. Consequently, we can transform from any normal 
distribution to the standard normal distribution. 


Transforming a normal distribution to the standard normal 
distribution 


If a variable x has a normal distribution with mean u and standard 
deviation o, then the variable 

© pb 

a 


has the standard normal distribution. 


Activity 20 Shell thicknesses 


= The shell thickness, x, of eggs produced by a large flock of White Leghorn 
hens follows (approximately) a normal distribution with mean 
u = 0.38 mm and standard deviation o = 0.03 mm. 


Calculate the value of z corresponding to each of the following values of x 
(in mm). In each case, interpret your answer by completing a sentence of 
the form ‘So a shell thickness of *** mm is *** standard deviations *** 
than the mean thickness of *** mm. 


(a) xz = 0.40; (b) x = 0.30. 


As noted in Activity 20, the thicknesses of eggshells of White Leghorn hens 
approximately follow a normal distribution. The heights of men in 
Scotland also approximately follow a normal distribution (as noted in 
Subsection 3.1 of Unit 7). So do the heights of 7-year-old boys, the 
cholesterol levels of young adults, the weight gains of calves on a standard 
diet and many other quantities. One reason that the normal distribution is 
important is because many natural quantities follow approximately a 
normal distribution. Indeed, the normal distribution is so ubiquitous that 
it is common to assume that a quantity follows a normal distribution unless 
there is reason to think otherwise. Thus, in the experiment where you grew 
mustard seedlings, you assumed that root length was normally distributed, 
both for seedlings grown in the light and those grown in the dark. 





White Leghorn hens 


Of course, you should not assume that a quantity follows a normal 
distribution if the data suggest otherwise, or if the situation suggests that 
the distribution will not be normal. So, for instance, it would be unwise to 
assume that incomes in the general population follow a normal 
distribution, because we know, from Unit 3 (Subsection 1.4), that 
distributions of incomes are usually right-skew, with a few people earning 
far more than the median income. 


You have also met another major reason for the importance of the normal 
distribution: the sampling distribution of a mean is approximately normal, 
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regardless of whether or not the distribution of individual observations is 
normal. This result is part of the central limit theorem. The central limit 
theorem also specifies the mean and variance of the distribution of the 
mean. 


Approximate normality of the sampling distribution of the 
mean (central limit theorem) 


If n is large, no matter what shape the population distribution, the 
sampling distribution of the mean for samples of size n will be 
approximately normal. The mean will equal the population mean pu 
and the standard deviation will equal the standard error SE = a/,/n. 


Figure 6 reproduces Figure 2 from Unit 7 (Section 2). It shows the 
proportions of students obtaining each examination mark in MS221 in one 
presentation. 


The distribution is clearly not bell-shaped, so it is far from normal. 
0.0255 
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Figure 6 The distribution of M5221 exam marks 


Suppose now that we took two students at random from the MS221 cohort 
and determined the mean of their two marks. The value of the mean 
depends on which two students we pick — repeatedly picking two students 
at random and calculating their mean mark will yield lots of different 
values. The distribution of the mean mark (from a sample of two students) 
is shown in Figure 7. The distribution is not bell-shaped, but is much 
closer to being bell-shaped than the distribution in Figure 6. If we take 
samples of 20 students (rather than just two) then the mean of their marks 
is normally distributed, approximately. That distribution is shown in 
Figure 8. 
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Figure 7 The distribution of the mean of 2 students’ marks 
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Figure 8 The distribution of the mean of 20 students’ marks 


+a Activity 21 Distributions of sample means 
The distribution of the MS221 examination marks has mean u = 66 and 
standard deviation o = 22. 


(a) Give the mean and standard deviation of the sampling distribution of 
the mean for samples of size 5, and then for samples of size 15. 


(b) Would the sampling distributions in (a) be approximately normal? 
Which would be closer to a normal distribution? 





aircraft from 1928 
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5.2 Inference about the means of populations 


This subsection contains new material, not previously covered in 
M140. 


There are a number of different hypothesis tests for examining whether a 
population mean takes a specified value, or whether two populations have 
means that are equal; they have close similarities. We first consider how to 
choose the appropriate hypothesis test. A unified description of the tests is 
then given that highlights their similarities. A new hypothesis test is also 
introduced. Associated with each hypothesis test is a method of forming a 
confidence interval and these methods are described later in the section. 


Which hypothesis test should be used? 


M140 contains a number of different hypothesis tests for answering the 
following questions: 


e Does the mean of a population take a specified value? 

e Do the means of two populations have the same value? 

For the first question, we have a single sample and would use one of the 
following tests: 

e Test 1. One-sample z-test 

e Test 2. One-sample t-test 

For the second question, we have two samples of data, one from each of 
two populations. We would use one of the following tests: 

e Test 3. Two-sample z-test 

e Test 4. Matched-pairs t-test 

e Test 5. Unpaired t-test for populations with a common variance 

e Test 6. Unpaired t-test for populations with unequal variances 
These tests make varied assumptions about observations following normal 
distributions and being random. Test 6 has not been covered in earlier 


units, but will be described later in this section. You will also learn in the 
Computer Book how to use Minitab to perform this test. 


The flow chart in Figure 9 gives the steps to follow in order to decide 
which test should be used when there is just one population. 


When there are two populations and we have a sample from each, one of 
the tests 3, 4, 5 or 6 would be used to test whether the population means 
are equal. The flow chart in Figure 10 gives the steps to follow in order to 
decide which of the tests to use (assuming one of these tests is suitable). 
The choice between test 5 or 6 depends upon whether ‘yes’ or ‘no’ is the 
response to the question ‘Population variances equal?’. Using the rule of 
thumb given in Unit 10 (Subsection 3.3), we treat the population variances 
as equal if the sample variances differ by a factor of less than three. 
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One population? 


Variance 
known? 





Sample size 
25 or greater? 


One-sample 
z-test 


One-sample 
t-test 


Figure 9 Flow chart for choosing a hypothesis test for inference about a 
population mean. (It is assumed that observations are random and that 
the population distribution is approximately normal if the sample size is 
small.) 


Activity 22 Choosing a one-sample test 


A sample is taken from a population in order to test the hypothesis that 
the population mean equals 15. What is the appropriate test for each of 
the following situations, assuming the population distribution is 
approximately normal? 


(a) The sample size is 30 and the population variance is known. 

(b) The sample size is 35 and the population variance is unknown. 
(c) The sample size is 15 and the population variance is unknown. 
( 


d) The sample size is 12 and the population variance is known. 
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Two populations? 






Paired 
data? 






Matched-pairs 
t-test 







Two-sample 
z-test 


Both variances 
known? 



















Both sample 
sizes at least 
25? 






Two-sample 
z-test 






Population 
variances 
equal? 


Two-sample t-test with 
pooled sample variance 









Two-sample t-test with 
unequal variances 






Figure 10 Flow chart for choosing a hypothesis test for inference about 
the difference between two population means. (Assumptions required for 
the selected test must also be satisfied.) 
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Activity 23 Choosing a hypothesis test to compare two means 


Samples are taken from two populations in order to test the hypothesis 
that the population means are equal. What is the appropriate test for each 
of the following situations? (Assume population distributions are 
approximately normal, where necessary.) The sample sizes are nı and no, 
the population variances are oł and o2, and the sample variances are s? 
and s3. 
(a) nı = 30, n2 = 50; o? and o3 are unknown; the data are not matched 
: 2 2 
pairs; st = 12.1 and s3 = 8.5. 
(b) nı = 30, ng = 14; o? and o3 are unknown; the data are not matched 
pairs; s? = 8.4 and s2 = 9.6. 
(c) ny = 12, no = 12; o? and o2 are unknown; the data are matched pairs; 
2 2 
sj = 4.8 and s5 = 7.3. 
(d) ny = 10, nz = 10; o? and o3 are unknown; the data are not matched 
: 2 2 
pairs; sj = 12.1 and s5 = 3.5. 


e) ny = 10, n2 = 12; o? and of are known; the data are not matched 
1 2 
pairs; s? = 6.5 and s2 = 5.9. 


You have now covered the material related to Screencast 3 for 
Unit 12 (see the M140 website). 


Performing the hypothesis tests 


After the appropriate test has been selected, the null and alternative 
hypotheses are specified. The (two-sided) hypotheses for all the tests are 
given in Table 8. For a one-sample test, the hypothesised value of the 
population mean (u) is A. When there are two populations, u4 and up are 
the population means and uq = HA — HB- 


Table 8 Null and alternative hypotheses for the (two-sided) z- and t-tests 


Ho Hı 
One-sample tests pHa pA 
Matched-pairs tests Liq = 9 [bq F 0 


Other two-sample tests f644—Mp =O b4a—bMp FO 


After specifying Ho and H4, the test statistic must be calculated. Table 9 
gives this statistic for all the tests. 


e If there is only one sample, then the sample size, the sample mean and 
the standard deviation are n, © and s. 


e If there are two samples forming matched pairs, then the sample mean 
and the standard deviation of the differences within a pair are d and s 
and the number of pairs is n. 


e If there are two (unmatched) samples, then n4 and ng denote the 
sample sizes, TĄ and Tg are the sample means, and s4 and spg are the 
sample standard deviations. 
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When two population variances are assumed to be equal, the pooled 
estimate of their common standard deviation is 


(na — 1)8%4 + (ng — 1)8% 


ga = 
E natnp—2 


If the population standard deviation or population standard deviations are 
known, they will replace the corresponding sample values (this can only 
apply when the test statistic equals z). 


Table 9 Estimated standard error (ESE) and test statistic for the z- and 


t-tests 
Test ESE Test statistic 
T-A 
1. One-sample z-test 2 Ta 
yn ESE 
T-A 
2. One-sample t-test en ja2 > * 
yn ESE 
3. T ] test 5% + sb — TA = TR 
. Two-sample z-tes ra ne 2 = — ESE 
d 
4. Matched-pairs t-test a t= ESE 
1 1 TA — ZLB 
5. Two- le t-test —+— t= 
wo-sample t-tes Sp T + an aan 


with a common variance 


6. Two-sample t-test 
with unequal variances 


2 2 = = 
SA p SB. g SATIB 
na np ESE 


To examine the strength of evidence against Ho that the sample data 
provide, the test statistic is compared to critical values. 


For tests 1 and 3, the test statistic follows a standard normal 
distribution, assuming Ho holds. So the critical values for the 
two-sided test are 1.96 and —1.96 at the 5% significance level, and 
2.58 and —2.58 at the 1% significance level. Hence, for example, the 
null hypothesis is rejected at the 5% significance level if the test 
statistic is greater than 1.96 or less than —1.96. 


For tests 2, 4 and 5, the test statistic follows a t distribution if Ho 
holds. Critical values for the two-sided test at the 5% significance level 
are given in Table 2 in Subsection 3.3 of Unit 10 (and the Handbook). 
The number of degrees of freedom equals n — 1 for tests 2 and 4, and 
na +np — 2 for test 5. If the magnitude of the test statistic is greater 
than the critical value, then the null hypothesis is rejected at the 

5% significance level. 


For test 6, the test statistic approximately follows a t distribution if 
Ho holds. However, its degrees of freedom is given by a relatively 
complicated expression, outside the scope of M140. 


211 


Unit 12 Review 





A Brinell hardness testing 
machine 
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The result of the hypothesis test should be stated clearly and 
conclusions drawn that reflect the setting from which the data came. 
It is also good practice to state any assumptions that have been made. 
These will involve the randomness and independence of observations 
and, for samples of modest size, the assumption that variation in a 
population is adequately modelled by a normal distribution. 


Test 6 is the ‘new’ hypothesis test that is appropriate when population 
variances appear to be unequal and the sample sizes are not both large. In 
activities in the Computer Book you will use Minitab to find p-values for 
this test. 





Example 4 Brinell hardness measurements 


In a Brinell hardness test, a hardened steel ball is pressed into the material 
being tested under a standard load. The diameter of the spherical 
indentation is then measured. A company is about to replace its current 
steel ball, Ball A, with a new steel ball, Ball B. Before doing so, it 
compares the measurements given by the two balls on eight pieces of 
material. Each piece of material was tested twice, once with each ball, 
giving the measurements in Table 10. 


The company want to examine whether either ball gives, on average, 
higher measurements than the other. 


Table 10 Diameter measurements from Brinell hardness tests 
Sample 1 2 3 4 5 6 7 8 


Ball A 63 44 62 51 32 53 46 64 
Ball B 42 49 48 29 51 45 52 41 


As a pair of measurements were made on each sample, the data are paired. 
Thus a paired t-test is appropriate (see the flow chart in Figure 10). We 
put ua = HA — Hpg Where wy, and up are the population means for the two 
balls. The hypotheses are: 


Ao: p4 =0 and Ay: py #0. 


To obtain the test statistic, the difference (d) between A’s reading and B’s 
reading are determined for each sample. 


Sample 1 2 3 4 5 6 7 8 
Difference (d) 21 —5 14 22 -19 8 -6 23 
Then n = 8, 

X ` d=21-5+14+ 22 -— 19+ 8 -— 6 + 23 = 58 


and 


Sd? = 21? + (—5)? + 14? + 22? + (—19)? + 8? + (—6)? + 237 = 2136. 


5 z-tests, t-tests and confidence intervals 








a= — (Se — — = : (2136 - =) 








~ 245.07. 
These summary statistics give s ~ 245.07 ~ 15.655, 
ESE — s ds 15.655 
vn V8 
~ 5.5349 
and 
ie d 7.25 
~ ESE ~ 5.5349 
~ 1.310. 


The critical value for a t-test is obtained from Table 2 in Subsection 3.3 of 
Unit 10 (and the Handbook). The number of degrees of freedom is 

n — 1 = 7, so the critical value is 2.365. The value of t is 1.310, which is 
less than 2.365. Thus Ho is not rejected at the 5% significance level. 


The conclusion is that there is little evidence of one ball giving higher 
measurements, on average, than the other. This does not mean that the 
balls definitely do not differ systematically — a larger experiment might 
find evidence of a difference between them. 


The assumptions underlying the test are that observations are random, 
each pair of measurements is independent of all other pairs, and the 
differences (d) are approximately normally distributed. 


Activity 24 Cholesterol reduction +a 


In a nutrition experiment, the effectiveness of two high-fibre diets at =— 
reducing serum cholesterol levels was examined. Fifty-seven men with high 
serum cholesterol were randomly allocated to receive an ‘oat’ diet or a 
‘bean’ diet for 21 days. Table 11 summarises the fall in serum cholesterol 
levels (before diet — after diet). Test whether there is a difference between 
the diets in their effects on cholesterol levels. (The data are artificial.) 


Table 11 Summary statistics for the fall in cholesterol (mg/dl) on two 





diets 
Sample Sample Sample standard 
size mean deviation 
Oat 29 58.3 19.2 No good for this experiment: 
Bean 28 46.4 16.5 oats and beans together in 


these cookies 
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You have now covered the material related to Screencast 4 for 
Unit 12 (see the M140 website). 


Constructing confidence intervals 


The task of forming confidence intervals for a population mean was first 
addressed in Unit 9. Given a set of data, there are a range of likely values 
that the population mean might equal. A confidence interval gives a 
precisely defined range through consideration of hypothesis tests. 


Let u be the population and consider the hypotheses, 
Hy: p=A and Ay pA: 


Then Hp will be rejected at the 5% significance level for some values of A 
but not for others. 


Confidence intervals 


A 95% confidence interval for u includes all values of A for which we 
cannot reject Hp at the 5% significance level. 


A 99% confidence interval for u includes all values of A for which we 
cannot reject Ho at the 1% significance level. 


Thus a confidence interval contains the plausible values that u might take. 
If you took a very large number of samples from a population, then from 
each sample you could calculate a 95% confidence interval for u. Most of 
these intervals would contain the true value of u, but some would not. The 
definition of a confidence intervals enables the following precise statements 
to be made. 


About 95% of the confidence intervals will contain the population 
mean. For the remaining 5%, about 2.5% will give intervals that are 
completely below the population mean and about 2.5% will give 
intervals completely above it. 


So if you say that a 95% confidence interval includes the population mean, 
you will be right 95% of the time; you are 95% confident that your 
statement is correct. 


At the same time, after calculating a confidence interval, it is wrong to say 
‘the probability is 0.95 that this confidence interval contains u.’ Once the 
confidence interval has been calculated, either it contains the value of u or 
it does not. In the former case, the probability is 1 that the interval 
contains u, while in the latter case, the probability is 0. (If we take lots of 
samples from a population, the confidence interval will keep changing but 
the population mean remains the same. Once the confidence interval has 
been determined, there is nothing left that is random. Therefore, it is only 
before gathering data that the probability is 0.95 that the future 95% 
confidence interval will contain ju.) 


5 z-tests, t-tests and confidence intervals 


To show how a confidence interval can be obtained from a hypothesis test, 
consider the one-sample t-test of Ho: u = A versus Hı: u #4 A. From 
Table 9, the test statistic is 


7-A 

t= ESE 

If te is the critical value for the 5% significance level, then the hypothesis 
that A is the population mean is not rejected at the 5% level if 


7o <t d ee Bae 
ESE ee cS RSE ` 


Thus, it is not rejected at the 5% level if 
Z-A<t.xESE and —tex ESE < 7-A, 

which is equivalent to 
Z—t.x ESE < A and A< T+te x ESE. 

Thus A is not rejected at the 5% significance level if it is in the interval 
(@—t. x ESE, T+ te x ESE). 


By definition, this interval is the 95% confidence interval for the 
population mean. It is an example of the following more general result. 


Confidence interval for a mean or the difference between two 
means 





. i , 
The lower limit of the confidence interval is: Confidence intervals: they're 
everywhere! 
point estimate — (z or t critical value) x ESE 
and the upper limit is: 
point estimate + (z or t critical value) x ESE, 


where ESE is the estimated standard error of the point estimate. 


To apply this result requires a point estimate, a z or t critical value, and 
an ESE. The point estimate will be the sample mean when making 
inferences about one population mean, and it will be the difference 
between the two samples means when making inferences about the 
difference between two population means. Table 9 gives the ESE. Actually, 
it gives lots of ESEs. To decide which one is appropriate: 


e Ifthe interval is for a population mean, consider what hypothesis test 
you would use to test Hp: u = A. The ESE for that hypothesis test is 
the one that should be used to form the confidence interval. 


e Ifthe interval is for the difference between two population means, 
consider what hypothesis test you would use to test Ho: u4 = Hp. 
The ESE for that hypothesis test is the one that should be used to 
form the confidence interval. 


The choice of hypothesis test also determines the z or t critical value that 
is used to form the confidence interval. If a z-test is the appropriate 
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hypothesis test, then the z-value of 1.96 should be used for a 95% 
confidence interval and 2.58 for a 99% confidence interval. If a t-test is the 
appropriate hypothesis test, then te should be obtained from Table 2 in 
Subsection 3.3 of Unit 10 (and the Handbook); the degrees of freedom (as 
with the hypothesis tests) are n — 1 for the one-sample t-test and paired 
t-test, and na + ng — 2 for the two-sample t-test with pooled sample 
variance. 





Example 5 Confidence interval from Brinell hardness measurements 


Example 4 concerned measurements from Brinell hardness tests using two 
hardened steel balls, Ball A and Ball B. A paired t-test was used to test 
whether the difference between the population means (u4 and up) was 0. 


Suppose, now, that a 95% confidence interval for 44 — Hpg is required. 
Using the data in Example 4, the mean for Ball A is 
63 + 44 + 62 + 51 + 32 + 53 + 46 + 64 
8 
and the mean for Ball B is 
42+ 49+ 48 + 29 + 51 + 45 + 52 + 41 
8 
Hence the point estimate is 51.875 — 44.625 = 7.25. (This equals d, of 
course, which was calculated in Example 4. The separate means for Ball A 


and Ball B did not have to be calculated and d could have been taken as 
the point estimate.) 


From Example 4, ESE ~ 5.5349. The degrees of freedom are n — 1 = 7 
and, for 7 degrees of freedom, 2.365 is the critical value, te. The value of 
2.365 could be obtained from Table 2 in Subsection 3.3 of Unit 10 (and the 
Handbook), but it has already been obtained in Example 4. 





= 51.875 





= 44.625. 


Hence the lower limit of the 95% confidence interval is 
7.25 — 2.365 x 5.5349 ~ —5.8 

and the upper limit is 
7.25 + 2.365 x 5.5349 ~ 20.3, 

so the 95% confidence interval for p4 — up is (—5.8, 20.3). 


In the last example, notice that the 95% confidence interval contains the 
value 0 — so the hypothesis that the population means are equal would not 
be rejected at the 5% significance level. This was also the conclusion from 
the hypothesis test in Example 4, and it illustrates the following close 
connection between confidence intervals and hypothesis tests. 


5 z-tests, t-tests and confidence intervals 


e Ifthe 95% confidence interval does not include the value given by 
the null hypothesis, we reject the null hypothesis at the 5% 
significance level. 


e If the 99% confidence interval does not include the value given by 
the null hypothesis, we reject the null hypothesis at the 1% 
significance level. 


Activity 25 Confidence intervals for cholesterol reduction 


US 


Data from a nutrition experiment were summarised in Activity 24. The = 
data related to the reduction in serum cholesterol level on two diets, an 

‘oat’ diet (diet A) and a ‘bean’ diet (diet B). Let p4 and upg denote the 
population mean reductions on the two diets. 


(a) Construct a 95% confidence interval for p4 — upg. 

(b) Construct a 99% confidence interval for p4 — upg. 

(c) Which interval is shorter? Would that always be the shorter interval? 
( 


d) Which of the confidence intervals contain the value 0? How does that 
relate to the results of the hypothesis test in Activity 24? 


You have now covered the material needed for Subsection 12.2 of o 
the Computer Book. 


Exercises on Section 5 





Exercise 8 Which one-sample test? 


A sample is taken from a population in order to test the hypothesis that 
the population mean equals 50. What is the appropriate test for each of 
the following situations, assuming the population distribution is 
approximately normal? 


(a) The sample size is 20 and the population variance is known. 

(b) The sample size is 12 and the population variance is unknown. 
(c) The sample size is 40 and the population variance is unknown. 
( 


d) The sample size is 35 and the population variance is known. 





217 


Unit 12 Review 


218 





Exercise 9 Which test for comparing two means? 


Samples are taken from two populations in order to test the hypothesis 
that the population means are equal. What is the appropriate test for each 
of the following situations? (Assume population distributions are 
approximately normal, where necessary.) The sample sizes are nı and no, 
the population variances are oł and o2, and the sample variances are s? 
and s2. 


(a) ny = 8, ng = 18; o? and o3 are unknown; the data are not matched 
pairs; sî = 23.8 and s2 = 28.2. 

(b) nı = 20, ne = 20; o? and o2 are unknown; the data are matched pairs; 

2 2 

sj = 1.3 and s5 = 1.7. 

(c) ny = 50, nz = 9; o? and o3 are unknown; the data are not matched 
pairs; s? = 1.7 and sł = 9.1. 

(d) nı = 30, n2 = 30; o? and 3 are unknown; the data are not matched 
pairs; s? = 11.3 and 52 = 15.5. 


e) ny = 10, no = 30; o? and of are unknown; the data are not matched 
1 2 
pairs; s? = 17.4 and s2 = 10.6. 





6 Correlation and regression 


Often, the reason for gathering data is to learn about the relationship 
between different variables. When only two variables are involved, much 
can be learned about their relationship by drawing a scatterplot of one 
variable against the other. In Subsection 6.1, we consider the information 
that a scatterplot can provide. In Subsection 6.2, correlation and 
regression are reviewed. 


6.1 Scatterplots and relationships 


Data are said to be linked when two or more variables are recorded for the 
same sampling units. Here, the focus is on the case where there are two 
variables, so that the linked data are paired data. 


Scatterplots are a useful tool for examining the relationship between a pair 
of variables. They can address the following question/s. 


Is there a relationship between the two variables? If so: 
e Is the relationship positive or negative, or is it neither? 
e Is the relationship strong or weak? 


e Is the relationship linear? 


6 Correlation and regression 


Professor Hans Rosling (b. 1948) 


Hans Rosling is a Swedish public health doctor, academic and 
statistician. He began his career in public health, spending some time 
in remote rural parts of Africa. Since 1997, he has been Professor of 
International Health at the Karolinska Institute, a world-renowned 
medical university in Stockholm. 


More recently, Rosling has become famous for persuasively presenting 
data and ideas on health, international development and many other 
things. With his son and daughter-in-law he set up the Gapminder 
Foundation in 2005, to develop animated software for showing data 
and to use it to show global development trends. 





Professor Hans Rosling does great things with scatterplots! 


Figure 11 plots, for ten towns, the percentage of households in 
owner-occupied housing against the percentage of employed residents 
working in manufacturing. 
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Manufacturing industry (%) 


Figure 11 Percentage of employed residents working in manufacturing 
and percentage of households in owner-occupied houses 


There appears to be no relationship between the variables: if in an 
eleventh town you knew the percentage of employed residents working in 
manufacturing, it would not give any indication of the likely percentage of 
households in that town who live in owner-occupied housing. Similarly, 
knowing the percentage of households in owner-occupied housing would 
give no indication of the percentage of employed residents who work in 
manufacturing. 


There is said to be no relationship between two variables when 
knowledge of one of them provides no information about the value of 
the other. 


Figure 12 plots two variables that are related. The shaded area contains 
the points and it slopes upwards: as the percentage of unemployed men in 
a town increases, the percentage of households with no car also increases. 
Thus, if in an eleventh town the percentage of men unemployed were 
known, it would influence an estimate of the likely percentage of 
households in the town that had no car. 


6 Correlation and regression 


40 5 





Percentage of households with no car 








Percentage of men unemployed 


Figure 12 Percentage of males unemployed and percentage of households 
with no car in ten towns 


The variables in Figure 12 are said to be positively related because an 
increase in one variable is associated with an increase in the other variable. 
Variables are said to be negatively related if an increase in one variable is 
associated with a decrease in the other variable. An example of negatively 
related variables is given in Figure 13. 
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Metal distance 


Figure 13 Data from an ultrasonic calibration study 


Variables need not be positively or negatively related, even when there is 
clearly a relationship between them. This is illustrated in Figure 14, where 
y is large for values of x near 10 and values of x near 20, and smaller for 
values of x near 15. 
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Figure 14 A scatterplot of some data 


A relationship is said to be strong when all the points on a scatterplot 
lie close to a line. 


A relationship is said to be weak when all the points only loosely 
follow a line. 


The relationships in Figures 12, 13 and 14 are all strong. Figure 15 is an 
example of a weak (positive) relationship. 
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Figure 15 Average water consumption in metered and unmetered 
households 
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An important characteristic of a relationship is whether it is linear or 
non-linear. A relationship is said to be linear if it can be summarised 
reasonably well by a straight line. It is said to be non-linear if it can be 
summarised reasonably well by a curve but not by a straight line. 
Consequently, the relationships in Figures 13 and Figure 14 are non-linear, 
while Figure 12 shows a linear relationship. This is not typical though — 
linear relationships commonly occur in practice. 





— I 


ae a T ; : ‘ULL take it.’ 
Activity 26 Identifying characteristics of relationships 


For each of the following six scatterplots, say whether there is a 
relationship between the variables. If there is one, classify it as (i) positive, 
negative or neither, (ii) weak or strong, and (iii) linear or non-linear. 
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Figure 16 Six scatterplots showing different relationships 
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6.2 Linear relationships 


Correlation measures the strength of a linear relationship. 


Regression describes a linear relationship. 


Correlation 


The correlation coefficient has the following formula: 
le =F)(y~F) 

> (z -7)? x X (y - y)? 
The value of the correlation coefficient is always between —1 and +1. A 
value near +1 means there is a strong positive relationship between x and 
y and a value near —1 indicates a strong negative relationship between 
them. For example, the variables in Figure 12 (Subsection 6.1) show quite 
a strong positive relationship. Their correlation coefficient (calculated 
using the above formula and the data that gave the figure) is 0.91, which is 
reasonably close to 1. In contrast, the variables in Figure 15 have a 
relationship that is weak (and positive) with the correlation coefficient 
lower at œ 0.57. When there is little relationship between two variables, 
the correlation coefficient takes a value near 0. Thus in Figure 11, the two 
variables appear unrelated and the correlation coefficient is —0.01. 


Correlation = 


The correlation coefficient only reflects the degree to which a relationship 
is linear. In Figure 14, there is a very strong relationship but the 
relationship is non-linear — the correlation coefficient is 0.000 (correct to 
three decimal places). However, with many non-linear relationships, the 
relationship can partly be modelled by a straight line. For example, in 
Figure 13, the relationship is clearly non-linear, but a straight line with a 
negative slope would partly capture the negative relationship between the 
two variables. The correlation coefficient is —0.85, which is far from 0. 


Activity 27 Ordering correlation coefficients. 


Order the six scatterplots in Figure 16 (Subsection 6.1) according to the 
correlation between the variables in the plots. The scatterplot 
corresponding to the highest correlation coefficient should be first in the 
list, the one corresponding to the second highest correlation coefficient 
should be second in the list, and so on. Thus the one corresponding to the 
most negative correlation coefficient should be at the end of the list. 


The following is the procedure for calculating the correlation coefficient. 


6 Correlation and regression 


Calculating the correlation coefficient 


Given a batch of n linked data pairs, (x,y), the correlation coefficient 
(r) is obtained as follows: 


l Cluanie am EEE and) ay. 
2. Calculate 


Segor- a 
Su- =} y- ~ S 
X -7u -7) =) ry- (% 2) (Soy). 


3. Use the values from step 2 to calculate 
Se-Dw-) 
we a): 








Example 6 Basal area and weight of trees 


During crop-thinning of a small forest of Sitka spruce, seven trees were 
felled and the cross-sectional area of each tree at its base (its basal area), 
x, and total dry weight, y, were measured. The following results were 
obtained. 


Table 12 Basal area and dry weight of seven trees 


xg (m? x 100) 2.24 1.06 0.79 1.78 1.22 0.54 1.40 
y (kg) 79.1 33.9 19.2 54.8 51.0 12.0 46.7 


To determine the correlation coefficient between x and y we first calculate: 


S25 2.24 +1.06 + ... +1.40 = 9.03 





Xy =79.1 +33.9 + ... +46.7 = 296.7 
Soa? = 2.24? +1.06? + ... + 1.40? = 13.6737 Sitka spruce 
Xy? = 79.1? +33.9? + ... +46.7? = 15 703.59 


X wy = 2.24 x 79.1 + 1.06 x 33.9 +... + 1.40 x 46.7 = 459.91. 


The sample size is n = 7, so 





D(z = T)? Z Dr? = (a)? 
= 13.6737 — = = 2.025. 
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t thought you 


...awa the response. 


Su-pe= 5y- Sw 


296.7)? 
— 15703.59 — (296-0 sd D ~ 3197.75. 
De -A(y-) = Ery- SO 
= 459.91 — oe = 77.167. 


Hence in this example, 
2-7) -7) 


D(a — zy x Vy - 9" 
ee 


y 2.025 x 3127.75 


This correlation is close to 1, indicating a strong linear relationship 
between basal area and dry weight. 


correlation = 





Regression 


Data on Sitka spruce trees were given in Example 6. The dataset could be 
used to obtain an equation for estimating a tree’s dry weight from the 
cross-sectional area at its base. The equation is potentially useful since, 
while a tree is growing, its basal area is much easier to measure than its 
weight. A suitable equation can be obtained through least squares linear 
regression, provided the relationship between the variables is linear. 


In considering the correlation between two variables, it does not matter 
which variable is called x and which is called y — swapping them around 
would not change the value of the correlation coefficient. That is, in 
correlation the two variables have identical roles. In regression though, 
there is a response variable and an explanatory variable, and these have 
different roles. 


A response variable (usually denoted as y) is the variable that is 
being explained or whose value depends on the other variable. It is 
also the variable to be predicted if predictions are to be made. 


An explanatory variable (usually x) is the variable that is doing the 
explaining or is the variable on which the response variable depends. 


Suppose now that the dry weight of a Sitka spruce is to be estimated from 
its basal area. Then the dry weight is y and the basal area is x. In 
Example 7, we will obtain the least squares regression line relating basal 
area to dry weight. The line is 


y = —6.77 + 38.12. 
Given a tree’s basal value, this equation gives its fitted value: 


dry weight fitted value = —6.77 + 38.1 x basal area. 


6 Correlation and regression 


For example, the first tree in the dataset has a basal area of 2.24, so its 
fitted value is 


—6.77 + 38.1 x 2.24 ~ 78.6. 


If we did not know the tree’s actual dry weight, then it would be estimated 
as 78.6. 


The following table extends Table 12, from Example 6, to include the 
fitted values for the seven trees in the dataset. 


Table 13 Basal areas, actual dry weights and fitted values of seven Sitka 


spruce 

Basal area (x) 2.24 1.06 0.79 1.78 1.22 0.54 1.40 
Actual dry weight (y) 79.1 33.9 19.2 54.8 51.0 12.0 46.7 
Fitted value 78.6 33.6 23.3 61.0 39.7 13.8 46.6 


Figure 17 is a scatterplot of these data and also shows the least squares 
regression line. The actual values of the data are marked by black dots. It 
can be seen that the regression line virtually passes through the third, fifth 
and seventh data points, and is close to the others. Hence the line fits the 
data well. (In Example 6, the correlation coefficient of 0.97 indicated a 
strong linear relationship between basal area and dry weight, so it could be 
anticipated that the line would fit the data well.) 
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Figure 17 Plot of dry weight against basal area, with least squares 
regression line 
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Residuals are defined as the differences between the fitted values and 
the data values: 


Residual = Data — Fit. 


In Figure 17, the fitted values of the seven trees are shown as orange dots. 
From Table 13, their positions are (2.24, 78.6), (1.06, 33.6),..., (1.40, 46.6). 
As the actual data values (the black dots) are at (2.24, 79.1), 

(1.06, 33.9),..., (1.40, 46.7), the data points lie vertically above the fitted 
values, or vertically below them. Thus, residuals are the vertical distances 
from the data points to the line. 


A regression line can serve various purposes, but to understand why it is 
constructed in the way that it is, think of the line as a way of predicting 
the response variable, y, when we know the value of the explanatory 
variable, x. Each residual is the inaccuracy in predicting a y-value in the 
dataset, given its corresponding z-value. Thus, the residuals indicate the 
usefulness of the line for making predictions. The aim in least squares 
regression is to minimise the sum of the squares of these residuals. Thus, 
in deciding how close a line is to the data points, the perpendicular 
distances from the data points to the line are not the quantities of interest 
— the focus is on the vertical distances from each data point to the line. 
The least squares regression is calculated as follows. 


Calculation of the least squares regression line y = a + bz for 
a set of n data points (x,y) 


1. Calculate X) 2, Dy, X (2 T) and X (< — Z)(y —9). 
2. Calculate the means of x and y: 


gate and 7- >, 
n n 


3. The slope b is given by 
Ga 


4. The intercept a is given by 





a =ŅJ-— bT. 


For step 1 above, )>(a — 7)? and $> (x — T) (y — J) can be obtained from 


2 
eC - 7)? — ya S ee) 


and 


Le-au- -Y ey- 27., 


n 
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However, if you have calculated the correlation coefficient, you will already 
have obtained these quantities as part of the procedure for its computation. 
(This reflects the close connection between correlation and regression.) 





Example 7 Regression line relating a tree's basal area and weight 


To determine the least squares regression line relating a Sitka spruce’s 
basal area to its dry weight, the following quantities obtained in Example 6 
will be used. 


Spo, Sy 206.7, n= 7, 


X (z —@)? = 2.025, E(x- F)(y — 7) = 77.167. 








These give 
03 296.7 
2 oS 65. eg a = ep 
n 7 n 7 


Hence, the slope is 
X(x- eyo) _ 77.167 
D(z- 2)? 2.025 
and the intercept is 
a = J — bT = 42.386 — 38.107 x 1.29 
~ 42.386 — 49.158 = —6.772. 


b= 





~ 38.107, 


These form the regression line: 


y = —6.77 + 38.12. 





A regression line can be used to form estimates and make predictions. 
Standard terminology is to estimate a mean response and predict an 
individual response. The last example will be used to discuss the two cases. 


Given a particular basal area, we might want to: 


1. Estimate the mean dry weight of Sitka spruce that have that basal 
area. 


2. Predict the dry weight of a single Sitka spruce tree that has that basal 
area. 


In both cases, the estimate/prediction will be the point on the regression 
line that corresponds to the specified basal value, i.e. —6.77 + 38.17, where 
x is the specified basal area. However, accuracy will not be the same in the 
two cases. 


To elaborate, if the position of the regression line were known precisely, 
then the mean dry weight of Sitka spruce trees that have a particular basal 
area could be estimated with perfect accuracy. But there would still be 
uncertainty in predicting the dry weight of a single Sitka spruce tree from 
its basal area, because individual observations do not lie on the regression 
line — they vary around it. 
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More generally, if we took another sample of seven Sitka spruce, the 
regression line would not be identical to the one we have calculated. That 
is, the position of the regression line is affected by random variation. This 
random variation affects both the estimate of a mean response and the 
prediction of a single response. It is the only source of uncertainty that 
affects the estimated mean response, whereas individual variation also 
affects the accuracy with which a single response is predicted. 
Consequently, for any given x, there is greater uncertainty in predicting an 
individual response than in estimating a mean response. 


Given a particular x, a confidence interval for the mean response can be 
determined using methods similar to those described in Subsection 5.2 
(though the ESE is more complicated). Different values of x can be 
considered and the corresponding confidence interval determined for each. 
As x is varied, the confidence interval changes smoothly, getting narrower 
towards the mean of the x-values. This is illustrated in Figure 18, where 
long-dashed lines map end-points of the 95% confidence intervals for the 
mean dry weight as the basal area (x) changes. 


An interval estimate for a prediction is referred to as a prediction interval. 
The dotted lines in Figure 18 map end-points of the 95% prediction 
intervals for an individual response. It can be seen that the prediction 
intervals are much wider than the confidence intervals — much of the 
uncertainty in predicting an individual value stems from the random 
variation of individual values about the regression line. 


Dry weight (kg) 











0.5 1.0 1.5 2.0 
Basal area (m? x 100) 


Figure 18 95% confidence intervals and 95% prediction intervals for dry 
weight as basal area varies 
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6 Correlation and regression 


Figure 18 shows that the relationship between basal area and dry weight is 
approximately linear when the basal area is between 0.5 and 2.3. The data 
do not tell us whether the linear relationship extends outside that range. 
Hence, the regression line should not be used to make predictions (or 
estimate the mean response) for basal areas below 0.5 or above 2.3. More 
generally, making predictions outside the range of x-values in the original 
sample is termed extrapolation and should be avoided, as the validity of 
the predictions or prediction intervals would be unknown. 


Exercises on Section 6 


Exercise 10 Checkout time +f 


The time taken (y seconds) to deal with a customer at a supermarket 
checkout consists of a basic part plus an amount that depends on the 
number of items bought. Observations of the time taken for each of 30 
customers were recorded, together with the number of items (x) that the 
customer had bought. The following summarises the data: 


Sir= 620, J y= 2776, J ay —79136 


Soa? = 19378, Soy? =344742, n= 30. 


(a) Calculate the correlation coefficient between the time at the checkout 
and the number of purchases, based on this information. 


(b) Does the correlation coefficient suggest that a least squares regression 
line would be a good way of representing the relationship between time 
at the checkout and number of purchases? What further information 
would you need in order to answer this question with more confidence? 
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Exercise 11 Relating number of purchases to checkout time 
Using the data in Exercise 10, estimate the least squares regression line, 
y=a+t bu. 


Interpret the slope and intercept of this line in the context of the time 
taken to serve a customer. 


7 Computer work: binomial and t-test 


In Subsection 3.2, you learned the general form of the binomial 
distribution. In this section, you will explore the shape of this distribution 
for different values of n and p. You will also use Minitab to perform an 
unpaired t-test to compare two population means when it is not assumed 
that the two population variances are equal. You should work through all 
of Chapter 12 of the Computer Book now, if you have not already done so. 


Summary 


In this unit, we have reviewed the main themes of M140: numerical and 
graphical summaries of data, the collection of data, probability, hypothesis 
testing, confidence intervals, correlation and regression. You may have seen 
more clearly the similarities and links between many of the concepts and 
methods that you have learned in the module. You will also have gained 
more practice at using many of these methods. 


You have learned that the probabilities you calculated in Unit 6 for the 
sign test are probabilities from a binomial distribution in which p = 5. 
You have learned how to calculate binomial probabilities for other values 
of p and to identify situations where the binomial distribution arises. You 
have also used Minitab to calculate binomial probabilities and explored the 
shape of the binomial distribution for different sample sizes and values of p. 


There are a variety of situations in which z-tests and t-tests can be used to 
test whether a population mean has a particular value or if two population 
means are equal. You have learned how to choose the appropriate test for 
the different situations. You have also learned about a two-sample t-test 
that does not assume population variances are equal. You have learned to 
use Minitab to perform the test. 


Learning outcomes 


Learning outcomes 


After working through this unit, you should be able to: 

e interpret a growth chart 

e combine different sampling methods in designing a survey 
e identify situations where a binomial distribution applies 


e calculate probabilities for a binomial distribution, both by hand and 
using Minitab 


e choose an appropriate test for making inferences about a population 
mean 


e choose an appropriate test for making inferences about the difference 
between two population means 


e recognise the similarity of the test statistic for different forms of z-test 
and t-test 


e describe the two-sample t-test for comparing two population means 
when the population variances are unequal and perform the test using 
Minitab. 


Also, this unit has reviewed many learning objectives from the first 
11 units of M140 — more than you might have thought. Working though 
the unit will have consolidated your ability to: 


e find the mean and median of a batch of data 


e find the upper and lower quartiles and the interquartile range of a 
batch of data 


e calculate the variance and standard deviation of a batch of data 
e prepare a five-figure summary of a batch of data 

e draw and interpret the boxplot of a batch of data 

e draw astemplot of a batch of data 


e use stemplots and boxplots to decide whether a batch of data is 
symmetric, left-skew or right-skew 


e appreciate the priorities in summarising a batch of data 
e find the weighted mean of a set of numbers with associated weights 
e describe the major steps in producing the Retail Prices Index 


e calculate a simple chained price index and explain what is meant by 
its base date 


e describe quota sampling in general terms 


e choose a simple random sample using random numbers and a labelled 
list of the target population 


e choose a systematic random sample using random numbers and a 
labelled list of the target population 


e describe the relative strengths and weaknesses of simple and 
systematic random sampling 
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describe the principles involved in cluster sampling and stratified 
sampling 


choose a random sample for a stratified survey using random numbers 
and a labelled list of the target population 


choose a random sample for cluster sampling using random numbers 
and a labelled list of the target population 


distinguish between three kinds of experiments (exploratory, 
measurement and hypothesis testing) 


appreciate the need to randomise and reduce sources of bias when 
designing an experiment 


explain why placebos are used in clinical trials and the purpose of 
double-blind trials 


describe crossover, matched-pairs and group-comparative designs of 
clinical trials 


calculate probabilities based on random selection 
calculate joint and conditional probabilities 


state and use the relationship between joint and conditional 
probabilities 


express the independence of two events in terms of conditional 
probabilities 


state and use the general addition rule for probabilities 
count the number of ways that a specified combination can occur 


understand the concepts of a hypothesis test and the main steps in 
performing a hypothesis test 


carry out the y? test for contingency tables, taking account of the size 
of Expected values 


interpret a x? test in terms of the null and alternative hypotheses 
decide when to use the x? test for contingency tables 


appreciate that we can think of all normal distributions in terms of 
the standard normal distribution 


apply the formula that transforms any variable x with a given normal 
distribution to the variable z with the standard normal distribution 


appreciate that, whatever the shape of the population distribution, for 
a large enough sample size the sampling distribution of the mean is 
nearly always approximately normal 


write down the mean and standard deviation of the sampling 
distribution of the mean for samples of size n, given the population 
mean, u, and standard deviation, o 


carry out a two-sample z-test to analyse the difference between means 
carry out a matched-pairs t-test 


examine whether it is reasonable to assume that two population 
variances are equal 


Learning outcomes 


e carry out a two-sample t-test when population variances are equal 


e appreciate the relationship between hypothesis tests and confidence 
intervals 


e calculate a confidence interval for a population mean 
e interpret a confidence interval 


e calculate a confidence interval for the difference between two 
population means 


e explain what is meant by a relationship between two variables 
e recognise positive and negative relationships from a scatterplot 


e describe a relationship between two variables which is neither positive 
nor negative 


e recognise strong and weak relationships from a scatterplot 


e understand the concept of the correlation coefficient — in particular, 
how it relates to a relationship between two variables shown on a 
scatterplot 


e calculate the correlation coefficient by hand 


e understand the terms response variable and explanatory variable, and 
decide which is which in a given example 


e calculate a least squares regression line for a batch of linked data by 
hand 


e use a regression line to predict the value of the response variable, and 
know when it is appropriate to do this 


e interpret a confidence interval for estimating the mean response using 
a least squares regression line 

e interpret a prediction interval for individual predictions from a least 
squares regression line. 





Take a bow for reaching the end of the last unit! 
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Solutions to activities 


Solution to Activity 1 


(a) There are 10 data values. The fifth longest time between elections is 
49 months and the sixth longest is also 49 months. As 
49+49 _ 
= 
the median is 49 months. 
(b) Using the formula, 


44+7+55+ 49 + 48 + 58+ 61+ 49 + 47 + 60 
10 


49, 





T= 


478 
= — = 47.8. 
10 a 


So the mean is 47.8 months. 


(c) One period between general elections was only 7 months, much 
smaller than the other values. This small value pulls down the mean 
but has little effect on the median. For this reason, the mean is 
smaller than the median. However, the mean and median are quite 
close and both seem reasonably representative of the data. (Though 
you might argue that 7 months is an outlier, and so the median is 
preferable to the mean because it is not affected by the odd outlier.) 


Solution to Activity 2 
(a) There are 15 data values, so 
n+1 16 3(n+1) 48 
4. 4 = 4 4 
Hence the lower quartile is the 4th item and the upper quartile is the 


12th item (4th from the top). Therefore Qı = 57 and Q3 = 65. The 
interquartile range is the distance between them: 65 — 57 = 8. 


12. 





(b) Now there are 20 data values, so 


n+1 21 3(n + 1 63 
4 = Zg = 54 and seed) = T = 
Thus the lower quartile is one-quarter of the way from 17 to 19 and 
the upper quartile is three-quarters of the way from 23 to 24. 
Therefore Qı = 17.5, Q3 = 23.75 and the interquartile range is 
23.75 — 17.5 = 6.25 ~ 6. 





15 
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Solution to Activity 3 


The values of $` x? and X` x must be determined. Don’t write down each 
number you square; just use your calculator memory: 


S x? = 20.5? + 21.6? +- - + 23.6 = 4112.92, 


and 


Xoz = 20.5 + 21.6 + --- + 23.6 = 192.0. 
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The sample size is n = 9 and so the variance equals 
1 (Soa)? 1 192.02 
2 2 
es — 41") | = = ( 4112.92 — 
TaI (= j n 8 9 


1 
= z (4112.92 — 4096.0) 





= 2.115. 


The standard deviation equals 


v variance = V 2.115 ~ 1.45. 


Solution to Activity 4 


(a) Reading from the figure, 4.5kg is the 2nd percentile at 10 weeks old, 
so 2% of boys weigh less than 4.5 kg at 10 weeks. 


(b) 10kg is the 75th percentile at 46 weeks old, so 25% of boys weigh 
more than 10 kg at 46 weeks. 


(c) 6kg is the 2nd percentile at 22 weeks old, and 7.5kg is the 
50th percentile. Hence the proportion of 22-week-old boys weighing 
between 6kg and 7.5kg is 50% — 2% = 48%. 

Solution to Activity 5 

(a) From the solution to Activity 2, Qı = 57 and Q3 = 65. 


There are 15 data values, so the middle value is the 8th. Therefore, 
n = 15 and M = 63. 


The lowest and highest values are Er = 47 and Ey = 73. 


Hence the following is the five-figure summary of the data: 











63 
p= Mh || 5y 65 
47 73 
(b) This is the boxplot: 
Qı M Qs 
Er, Eu 
T T T T T T > 
50 55 60 65 70 75 


(c) The median (M) is much nearer the upper quartile (Q3) than the 
lower quartile (Qı) and the left-hand whisker is a little longer than 
the right-hand whisker. Hence the boxplot shows the data are 
left-skew and, from the position of M, the skewness is fairly marked. 


Solution to Activity 6 


(a) There is one particularly high value, 53.46 seconds, which is listed 
separately. The stretched stemplot looks like this: 


44)3 
44/5 
45 | 0 
45 | 6 
46 | 1 
46 | 6 
47 
47|5 
48/1 
48 | 9 
AI 534 








n= 24 44) 3 represents 44.3 seconds 


(b) The stemplot shows that the data are clearly right-skew. There are a 
lot of values in the range 44—46.5 seconds, then a few values stretching 
out to 49 seconds, and then the value of 53.4 seconds. 


Solution to Activity 7 
For ‘Food and catering’, 
rw = 1.013 x 163 = 165.119. 


Similar calculations for other groups yield the last column of the table. 


Price ratio for August 2013 2013 Price ratio 
relative to January 2013 weights x weight 


Group (r) (w) (rw) 
Food and catering 1.013 163 165.119 
Alcohol and tobacco 1.021 91 92.911 
Housing and household 

expenditure 1.017 419 426.123 
Personal expenditure 1.055 83 87.565 
Travel and leisure 1.022 244 249.368 
Then 


D rw = 165.119 + 92.911 + --- + 249.368 = 1021.086 
and 
y us 163 + 91 + - - -244 = 1000. 


Thus the all-item price ratio is 


X rw 1021.086 
&— = —— ~ 1.021. 
sw 1000 
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Solution to Activity 8 


(a) 


The numbers in each set are: 


A BCDEF 
8 15 16 9 7 13 


Dividing these by the population size of 68, the following gives the 
proportion of the population in each set: 


A B C D E F 
0.118 0.221 0.235 0.132 0.103 0.191 


The labels selected are: 
68, 42, 66, 14, 60, 63, 24, 06, 21, 52, 37, 19, 33, 55, 03, 11, 18. 


The sample is shown in the following table: 


Label Initials Set Label Initials Set Label Initials Set 


03 APD. A 21 GBY. B 55 MJP. E 
06 J.L. A 24 RDM. C 60 MAT. F 
li GEL B 32 DAS: C 6 RJC F 
14 J.S.R. B 37 CAM. C 66 E.D. F 
18 PJG.: B 42 AL. D 68 M.B. F 
19 W.W.S. B 52 GCT. E 


The numbers selected from each set are: 


ABCDEF 
2 5 3 1 2 4 


Dividing these by the sample size of 17, the following gives the 
proportion of the sample in each set: 


A B C D E F 
0.118 0.294 0.176 0.059 0.118 0.235 


Comparing these proportions with the proportions in solution (a), B 
and F are over-represented (and E is marginally over-represented), C 
and D are under-represented, while A is spot on. 


As we want to sample a quarter of the population, we will start at 1, 2, 
3 or 4. The selected digit from row 10 is 3, so we start at label 03. 


Every fourth label is in the sample: 
03, 07, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67. 


The sample is shown in the following table: 


ae 


Label Initials Set Label Initials Set Label Initials Set 


03 A.P.D. A 27 TPH. C 5L JBR: E 
07 J.D.H. A 31 TRF. C 5 MJP. E 
11 CEL B 35 D.L. C 59 DBM. F 
15 S.L. B 39 WOJ. C 6 RJC F 
19 WWS. B 43 LRP. D 67 YSH. F 
23 MS. B 47T ZG. D 


The numbers selected from each set are: 


ABCDEF 
2 4 4 2 2 3 


Dividing these by the sample size of 17, the following gives the 
proportion of the sample in each set: 


A B C D E F 
0.118 0.235 0.235 0.118 0.118 0.176 


The differences between these proportions and those in (a) are small. 
In fact, as each sample size must be a whole number they could not be 
closer. So the sample is a good representation of the population, as far 
as we can tell. 


The population is listed by set — first set A, then set B,.... 
Systematic sampling is picking every fourth person, so it is certain to 
take approximately a quarter of the people in each set. Hence it could 
be anticipated that the systematic sample would represent the 
population better than the simple random sample. 


Solution to Activity 9 


(a) There is no single correct answer to this question. One approach is to 


group parishes together so as to form a moderate number of strata — 
say about 10 strata. This is because there are too many parishes to 
treat each as an individual stratum. An alternative would be to treat 
the parishes as clusters and randomly sample 10 of them, say. (But 
forming clusters would not utilise knowledge about parishes regarding 
their geographic location, proportion of council houses, or age of 
houses in different areas.) Each stratum should consist of adjacent 
parishes that have a similar housing stock. 


Houses within a stratum should be grouped by their council tax band 
and by whether or not they are privately owned. (Some council tax 
bands might be combined to reduce the number of strata.) Then 
streets within each of these strata should be treated as clusters and a 
random subset of them selected. It is essential to treat streets as 
clusters so as to reduce the travelling time of the interviewer. 


A large number of houses should be selected (perhaps 50%) from each 
street, as only a small proportion will have any disabled people living 
in them. Most of the interviews will consequently be very fast, so 
walking between interviews should be minimised. 
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(b) The above procedure is likely to yield a sample that is better than a 
completely random sample at reflecting the characteristics of the 
unitary authority’s housing stock. This should also hold for 
alternative sampling schemes that you may have proposed. Hence the 
procedure should reduce sampling variation compared with that of 
simple random sampling. The main benefit though, is that travelling 
time should be far less because only a modest number of streets (the 
clusters) will feature in the survey, so the survey should cost much less 
than with a simple random sample. 


Solution to Activity 10 


Statement G is true: in a double-blind trial, neither the patients nor the 
doctors know which patients have received the drug being tested and 
which the placebo, while an appropriate independent person does have 
this information. Consequently, statements A and D are false. 


It has been shown that patients can genuinely respond to the process of 
being treated, even when the treatment contains no beneficial ingredient. 
Similarly, doctors can influence a patient’s response in ways other than 
through the medication. Hence statement E is true, while B and C are 
false. 


Patients must give informed consent before they are entered into a clinical 
trial. Thus they must always be told that they might be given just a 
placebo. Indeed, it would be unethical not to tell them that. Thus 
statement F is false. 


Solution to Activity 11 


(a) It is a group-comparative design because there are two groups, 
patients were allocated to the experimental group or control group at 
random and the design had no other features or additional complexity. 


(b) The validity of the trial is questionable because any relevant gender 
differences would bias results. For example, if men tended to have 
more extreme arthritis symptoms than women, then the experimental 
treatment (taken by the women) would only seem poorer than the 
existing drug (taken by the men) if it was actually much worse than 
the existing drug. 


(c) If the experiment is repeated, half the women should be allocated at 
random to the experimental group and half to the control group, using 
a simple random sample to choose which women were allocated to the 
experimental group. Similarly, half the men should be picked at 
random from the male patients and allocated to the experimental 
group and the other half of the men should be allocated to the control 
group. This meets the requirements of randomisation (so neither drug 
is deliberately favoured) and ensures the same gender ratio in the two 
groups. Essentially, this procedure treats women as one strata and 
men as a second strata — it is called stratified randomisation. 


Solution to Activity 12 


(a) There are 227 research fellows in the population of 794 academic 
statisticians. Hence, the probability that the selected person is a 
research fellow is 

227 
— ~ 0.286. 
794 

(b) Among the 794 academic statisticians, there are 275 aged 30-39. 

Hence, the probability that the selected person is aged 30-39 is 


275 
— ~ 0.346. 
794 0.346 


(c) Among the 794 academic statisticians, there are 82 research fellows 
aged 30-39. Hence, the probability that the selected person is a 
research fellow aged 30-39 is 


82 
— ~ 0.103. 
794 


Solution to Activity 13 


(a) The subpopulation of interest are the 211 people in Table 4 who are 
aged 40—49, of whom 70 are senior lecturers. Hence, 


70 
P(senior lecturer|aged 40-49) = ni 0.332. 


(b) Now the subpopulation of interest are the 156 people in Table 4 who 
are senior lecturers. Of these people, 70 are aged 40-49. Hence, 


70 
P(aged 40—49|senior lecturer) = igs = 0.449. 


Solution to Activity 14 


(a) There are 66 professors aged 50-59. As there are 794 academic 
statisticians, 


66 
P(professor aged 50-59) = 704 = 0.083. 


(b) There are 172 professors in total. Hence, 


172 
P(professor) = 704 (œ 0.217). 


(c) The subpopulation of interest are the 172 people in Table 4 who are 
professors, of whom 66 are aged 50-59. Hence, 


66 
P(aged 50-59|professor) = IZ (œ 0.384). 


(d) From (a), 
P(A and B) = P(professor aged 50-59) ~ 0.083. 
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Also, from (b) and (c), 


1 
P(A) = P(professor) = 70d 


and 


P(B|A) = P(aged 50-59|professor) = = 


-£ 
So 
PnP so 
794 * 172 
66 
~ 794 


~ 0.083 = P(A and B). 
(In decimals, P(A) x P(B) ~ 0.217 x 0.384 ~ 0.083.) 


Solution to Activity 15 


(a) From Table 4, there are 172 professors and 7 + 6 + 8 = 21 so there are 
21 academic statisticians aged at least 60 who are not professors. 
Hence, adding 172 to 21 gives a total of 193 UK academic statisticians 
who will be invited. If we simply added the number of professors (172) 
to the number of academic statisticians aged 60 or over (68), then we 
would double-count the 47 people who are both professors and aged 
60 or over. Another way of obtaining the figure of 193 is from 
172 + 68 — 47 = 193. 


(b) There are 47 UK academic statisticians who are both professors and 
aged 60 or over. So 47 people would be invited. 
(c) From (a), 
193 
P(A or B) = — ~ 0.243 
(A or B) = 
and from (b), 


P(A and B) = a ~ 0.059. 


794 
172 68 
P(A) = — ~ 0.21 P(B) = = ~0. i 
(d) (A) 794 0.217 and (B) 794 0.086 
Hence, 

P(A) + P(B) — P(A and B) ~ 0.217 + 0.086 — 0.059 
= 0.244 
~ P(A or B). 


Alternatively, to avoid having to deal with rounding you could put 
172 68 47 


P(A)+ P(B)- P(A and B= a T 
172 + 68 — 47 
7 794 
_ 193 
794 
= P(A or B). 


Solution to Activity 16 


(a) The number of ways of choosing three pairs from six (order does not 


matter) is 
6 pO oy 
3x2x1 


(b) The number of ways of choosing four pairs from nine is 


9x8x7x6 
90, —9X8xTx 


= ——___—__. = 126. 
7 4x3x2x1 o 


Solution to Activity 17 

(a) P(FSSSFS) = 0.4 x 0.6 x 0.6 x 0.6 x 0.4 x 0.6 
=0.6" x 0.4? 
= 0.1296 x 0.16 
= 0.020 736. 


(b) The number of sequences of 6 days that give 4 successes is 
6x5x4x3 
~4x3x2x1~ 
(c) P(4 successful days in 6 days) = °C x 0.64 x 0.4? 

= 15 x 0.020 736 
= 0.31104 ~ 0.311. 


(d) P(5 successful days in 8 days) = °Cs x 0.65 x 0.4° 
_8x7x6x5x4 
~ 5Xx4Xx3x2x1 
= 56 x 0.077 76 x 0.064 ~ 0.279. 


ay 15. 


x 0.6° x 0.43 


Solution to Activity 18 


Define ‘blood group O’ as success, so that p = P(S) = 0.44 and 
q =1-— p= 0.56. The number of trials (number of people) is n = 7 and we 
want P(x = 2). Using the formula for the binomial distribution, 
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P(x = 2) = "C2 x 0.44? x 0.5672 
TC x 0.44? x 0.56° 


~ 1X6 X 9.1936 x 0.055073 
2x1 


= 21 x 0.010 662 
~ 0.224. 





Solution to Activity 19 


One reason for not using Y` (Residual)? as the test statistic is that a 
residual’s evidence against Ho depends not only on the size of the 
Residual, but also on the size of the Expected value. To illustrate, suppose 
that a cell is expected to contain 3400 items but in fact contains 3420. 
The difference is relatively small — readily explained as random variation. 
On the other hand, if a cell is expected to contain 10 items but in fact 
contains 30, then the difference is large. Thus, although both cases give a 
(Residual)? of 20? = 400, only the latter provides strong evidence that the 
Expected value is wrong. 


Another reason (which you will not be aware of) is that we do not know 
the probability distribution of )>(Residual)?. As noted earlier (point 3 in 
the description of a hypothesis test), the probability distribution of a 
suitable test statistic must be fully known if Ho is true. 


Solution to Activity 20 
(a) When p = 0.38 and o = 0.03, the formula 





G—-p s x — 0.38 
a = nm 1ves. s = =m 
8 0.03 
When zx = 0.40, 
0.40 — 0.38 0.02 
= 0.03 0037 ont 


So a shell thickness of 0.40 mm is 0.667 standard deviations above the 
mean thickness of 0.38 mm. 

(b) When z = 0.30, 
0.30 —0.38 —0.08 
= 003 0.03 


So a shell thickness of 0.30 mm is 2.667 standard deviations below the 
mean thickness of 0.38 mm. 


~ —2.667. 


Solution to Activity 21 


(a) The sampling distribution of the mean mark for a sample of size 5 has 
mean 66 and standard deviation o/yn = 22/5 œ 9.83. 


The sampling distribution of the mean mark for a sample of size 15 
also has mean 66, but its standard deviation is 22/15 œ 5.68. 


(b) The sample sizes of 5 and 15 are not large, so the distributions of the 
sample mean may differ a little from a normal distribution — more so 
with the smaller sample size of 5. 


Solution to Activity 22 
(a) The variance is known so, from the flow chart in Figure 9, use the 


z-test. 


(b) The variance is unknown and the sample size is greater than 25 so, 
from Figure 9, use the z-test. 


(c) The variance is unknown and the sample size is less than 25 so, from 
Figure 9, use the t-test. 


(d) The variance is known so, from Figure 9, use the z-test. 


Solution to Activity 23 


(a) The data are not matched pairs, the population variances are 
unknown, and both sample sizes are above 25 so, from the flow chart 
in Figure 10, use the two-sample z-test. 


(b) The data are not matched pairs, the population variances are 
unknown, one sample size is less than 25, but we can assume the 
population variances are equal (9.6/8.4 ~ 1.14 < 3) so, from 
Figure 10, use the two-sample t-test with a pooled sample variance. 

(c) The data are matched pairs so, from Figure 10, use the matched-pairs 
t-test. 

(d) The data are not matched pairs, the population variances are 
unknown, one sample size is less than 25 (in fact, both are less than 
25), and we cannot assume the population variances are equal 
(12.1/3.5 ~ 3.46 > 3) so, from Figure 10, use the two-sample t-test 
with unequal variances. 


(e) The data are not matched pairs and the population variances are 
known so, from the flow chart in Figure 10, use the two-sample z-test. 
Solution to Activity 24 


Let u4 and up denote the population mean reductions in cholesterol level 
on the oat and bean diets (diets A and B), respectively. Then 


na =29, Ta = 58.3, s4 = 19.2, 
np =28, Tg =460.4, sp = 16.5. 


The data are not paired, the population variances are not known, and 
both sample sizes are above 25. From Figure 10, a two-sample z-test is 
appropriate. 


The hypotheses are: 
Ho: y4 =Hg and Hi: py F hp. 


We first calculate the value of ESE, the estimated standard error of 
LA— Zp: 


2 g 19.22 16.52 
pepe) “4428 = a. ~ 4.7366. 
NA nB 29 28 
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Hence the value of the test statistic is 
Ta—-@R  58.3— 46.4 


2 BREE 2, OO oo 610, 
” = ESE 1.7366 


The critical values are +1.96 (at 5%) and +2.58 (at 1%). Since 

1.96 < 2.512 < 2.58, we can reject Ho at the 5% significance level but not 
at the 1% significance level. There is moderate evidence that the two diets 
differ in the average reduction in serum cholesterol level that they yield. 
The reduction on the oat diet (diet A) appears to be greater. 








The only assumptions needed for this hypothesis test are that observations 
are all random and independent. The sample sizes are quite large, so the 
central limit theorem implies that the distribution of za will be 
approximately normal, as will that of Tp. 


Solution to Activity 25 


(a) The point estimate is TA — Tg = 58.3 — 46.4 = 11.9. From the solution 
to Activity 24, ESE ~ 4.7366. A two-sample z-test was used in 
Activity 24 so z-values will be the critical values. 


For a 95% confidence interval, the z-value is 1.96. Hence the lower 
limit of the 95% confidence interval is 


11.9 — 1.96 x 4.7366 ~ 2.62 
and the upper limit is 
11.9+ 1.96 x 4.7366 ~ 21.18. 


Thus the 95% confidence interval for 4 — Hp is 
(2.62 mg/dl, 21.18 mg/dl). 


(b) For a 99% confidence interval, the z-value is 2.58. Hence the lower 
limit of the 99% confidence interval is 


11.9 — 2.58 x 4.7366 ~ —0.32 
and the upper limit is 

11.9 + 2.58 x 4.7366 ~ 24.12. 
Thus the 99% confidence interval for 44 — Hp is 
(—0.32 mg/dl, 24.12 mg/dl). 


(c) A 95% confidence interval is always shorter than the corresponding 
99% interval, as in this example. 


(d) The 99% confidence interval contains 0 while the 95% confidence 
interval does not contain 0. Hence the hypothesis Ho: 4 — Hg = 0 
would not be rejected at the 1% significance level but it would be 
rejected at the 5% significance level. This was the result found in 
Activity 24. 


Solution to Activity 26 


In (a), the points all lie very close to a straight line sloping downwards, so 
there is a strong negative linear relationship between the two variables. 


In (b), the points follow a general upward trend, but a smooth line could 
not be drawn through the points that lay close to many of them. Thus 
there is a weak positive relationship between the two variables — it may be 
linear, but that is not clear. 


In (c), the scatterplot suggests no relationship between the two variables. 


In (d), the points follow a clear downward trend, but most of them would 
be some way from a straight line drawn through them. Thus there is a 
negative relationship between the two variables that looks linear, but is 
not very strong. 


In (e), the points all lie very close to a straight line sloping upwards, so 
there is a strong positive linear relationship between the two variables. 


In (f), the points all lie very close to a line that increases from left to 
right, so there is a strong positive relationship between the two variables. 
However, the points clearly follow a curve (not a straight line) so the 
relationship is non-linear. 


Solution to Activity 27 

From the scatterplots: 

e A strong positive linear relationship is shown in (e). 

e A strong positive non-linear relationship is shown in (f). 
e A weak positive relationship is shown in (b). 

e No relationship is shown in (c). 

e A weak negative relationship is shown in (d). 

e A strong negative linear relationship is shown in (a). 


From the figures it is not transparent whether (b) will give a higher 
correlation coefficient than (f), or vice versa. In fact, (f) has the higher 
correlation coefficient. The correct ordering is as follows. Correlation 
coefficients are also given. (You do not have the information to calculate 
the correlations. ) 


(e): 0.99, (£): 0.88, (b): 0.77, (c): 0.13, (d): —0.82, (a): —0.99. 
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Solutions to exercises 


Solution to Exercise 1 


(a) 


We start in row 36 of the random number table. 

For the first stratum, we want six numbers between 01 and 23: 
18, 17, 19, 08, 01, 16. 

Rearranged in order, the labels are: 
01, 08, 16, 17, 18, 19. 


For the second stratum, we want six numbers between 24 and 48. 
Starting in the random number table at the point where we left off, we 
get: 


31, 42, 25, 37, 35, 44. 
Rearranged in order, the labels are: 
25, 31, 35, 37, 42, 44. 


For the third stratum, we want five numbers between 49 and 68. 
Continuing from where we left off in the random number table, we get: 


61, 55, 64, 68, 56. 
Rearranged in order, the labels are: 

55, 56, 61, 64, 68. 
The first two single digits between 1 and 6 in row 95 are 4 and 6, 
which correspond to set D and set F. 


Set D is of size 9, so we select 9/3 = 3 people from set D. Set F is of 
size 13 and 13/3 rounds up to 5, so we select five people from set F. 


Starting in row 72, for set D we want three random number pairs 
between 40 and 48: 


42, 48, 46. 
Continuing on, for set F we want five numbers between 56 and 68: 
61, 59, 68, 65, 60. 


Hence, the following are the people in the subsamples (rearranged in 
the order of their labels): 

e from set D: A.L., A.T. and Y.H. 

e from set F: D.B.M., M.A.T., A.N.D., G.K.S. and M.B. 


Solution to Exercise 2 


(a) 


(b) 


Age-bands should be formed (perhaps under 25, 25-34, 35-44, 45-54, 
55 or over), and job grades should be grouped together (perhaps 
forming five or six groups) so that similar grades are in the same 
group. 

Employees should be listed so that those who share the same age-band 
and grade-group are listed consecutively. Each age-band/grade-group 
combination is a stratum. 


A sample of 500 from 6000 means that a 12th of the workforce is to be 
included in the sample. Generate a random number between 1 and 12 
(either using a computer or random number tables). Pick out every 
12th person from the list, starting with the person given by the 
random number. This gives a systematic sample and the proportion of 
people chosen from each stratum will be approximately the same for 
each stratum. 





Non-sampling errors could arise from refusal to answer, employee 
absence, out-of-date data, incorrect (or incorrectly transcribed) data, 
etc. 


Solution to Exercise 3 


(a) 


For a crossover design, a random half of the women should take the 
new treatment for four weeks. They should then switch to the old 
treatment. The difference of each patient’s symptoms under the two 
treatments should be recorded. The other half of the women should 
start on the old treatment for four weeks and then switch to the new 
treatment. The difference in response between new and old treatments 
is again the quantity of interest. The same should be done with the 
men. This design means that order of treatment should not bias 
results, and gender-bias is also controlled. A reasonable time on each 
treatment should elapse before symptoms are measured, so that they 
reflect the current treatment. 


For a matched-pairs design, pairs of patients are required. The 
patients within a pair should be similar with respect to gender, age 
and severity of illness. Then one person in the pair would be picked at 
random and allocated to the experimental treatment and the other 
person in the pair would receive the control treatment. The difference 
in their symptoms would be the data analysed. 


Finding matched pairs is hard, while a crossover trial is 
straightforward to run because arthritis is a chronic condition. Hence 
the crossover design is more suitable than a matched-pairs design. 
When they are suitable, crossover designs are better than 
group-comparative designs because they remove much of the variation 
between individuals as each individual is compared with himself or 
herself. Hence a group-comparative design would not be better. 
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Solution to Exercise 4 
(a) Define events A and B as follows. 


A: a randomly picked academic statistician is aged 40-49. 


B: a randomly picked academic statistician is a lecturer. 


Then 
211 59 
P(A) 70d 0.266 and P(A|B) 539 0.247 
As P(A) does not equal P(A|B), events A and B are not independent. 
(b) As 
239 
P(B) = = ~0.301 and P(A|B) = 0.247, 
then 


P(A and B) ~ 0.301 x 0.247 ~ 0.074. 


(c) Using the answers from (a) and (b), 
P(A or B) = P(A) + P(B) — P(A and B) 
~ 0.266 + 0.301 — 0.074 = 0.493. 
Solution to Exercise 5 


(a) The number of successes follows a binomial distribution. As p = 0.2, 
q=1-p=08. 


Here n = 4, so 
P(x = 1) = £C; x 0.2! x 0.8(4-)) 
4 
=% 0.2 x 0.8? = 0.4096 ~ 0.410. 


(b) Now n = 5, so 


P(x = 2) = 5C x 0.22 x 0.8672 
5x4 


x 0.2? x 0.83 = 0.2048 ~ 0.205. 
2x1 





(c) Now n = 7, so 
P(x = 0) = 7Co x 0.2° x 0.8070 
= 0.8" ~ 0.210, 
as (Cy = 1 and 02° = 1. 
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Solution to Exercise 6 


(a) It must be assumed that whether one person feels better is 


independent of whether other people feel better. Then probabilities 
come from a binomial distribution with n = 6, p = 0.6 and 
q=1-p=0.4. 
P(x = 2) = C x 0.6? x 0.44 
6x5 
x 0.36 x 0.0256 


2x1 
= 0.138 24. 





(b) Using the same assumptions as in (a), 


P(x = 1) = C; x 0.6} x 0.45 
= ° x 0.6 x 0.010 24 
~ 0.036 86 
and 
P(x = 0) = °Cp x 0.6° x 0.46 
= 0.4° ~ 0.00410. 
Thus P(2 or fewer people feeling better) is 
P(x = 2) + P(x = 1) + P(x = 0) ~ 0.138 24 + 0.036 86 + 0.004 10 
~ 0.179. 


Solution to Exercise 7 


(a) The hypotheses are as follows: 


A: locality and concern about air pollution are independent. 


Hı: locality and concern about air pollution are not independent. 


(b) Copy marginal totals from the Observed table to the Expected table. 


Then the Expected value for the first cell is: 
Row total x Column total _ 100 x 83 
Overall total ~ 400 
= 20.75. 


The other values are obtained in the same manner, leading to the 
following Expected table: 


Locality Yes No Total 


A 20.75 79.25 100 
B 20.75 79.25 100 
C 20.75 79.25 100 
D 20.75 79.25 100 


Total 83 317 400 
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(e) 


As a check on your calculations, the Expected values within the table 
must add to the marginal totals (apart from unimportant rounding 
errors). 


The Residual table is found by subtracting the terms of the Expected 
table from those of the Observed table. For the first cell, we have 


25 — 20.75 = 4.25. 
The Residual table is 
Locality Yes No 


A 4.25 —4.25 
B —4.75 4.75 
C —8.75 8.75 
D 9.25 —9.25 


The x? contribution of the first cell is 
(Residual)? 4.25? 
Expected 20.75 
~ 0.8705. 





The complete table of x? contributions is: 


Locality Yes No 


A 0.8705 0.2279 
B 1.0873 0.2847 
C 3.6898 0.9661 
D 4.1235 1.0797 


The y? test statistic is the sum of the eight y? contributions: 


x? = 0.8705 + 0.2279 + 1.0873 + 0.2847 + 3.6898 
+ 0.9661 + 4.1235 + 1.0797 
~ 12.330. 


The Observed table is a 4 x 2 table, so its degrees of freedom are 
(4—1) x (2—1) = 3. Hence CV5 = 7.815 and CV1 = 11.345. 


Since 12.330 > 11.345, we reject the null hypothesis at the 
1% significance level. Thus there is strong evidence that concern in a 
household about air pollution varies with locality. 


Solution to Exercise 8 


(a) 
(b) 
(c) 
(d) 


The variance is known so, from the flow chart in Figure 9 
(Subsection 5.2), use the z-test. 


The variance is unknown and the sample size is less than 25 so, from 
Figure 9, use the t-test. 


The variance is unknown and the sample size is greater than 25 so, 
from Figure 9, use the z-test. 


The variance is known so, from Figure 9, use the z-test. 
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Solution to Exercise 9 


(a) The data are not matched pairs, the population variances are 
unknown, one sample size is less than 25 (in fact, both are less than 
25), but we can assume the population variances are equal 
(28.2/23.8 ~ 1.18 < 3) so, from the flow chart in Figure 10, use the 
two-sample t-test with a pooled sample variance. 


(b) The data are matched pairs so, from Figure 10, use the matched-pairs 
t-test. 


(c) The data are not matched pairs, the population variances are 
unknown, one sample size is less than 25, and we cannot assume the 
population variances are equal (9.1/1.7 ~ 5.35 > 3) so, from 
Figure 10, use the two-sample t-test with unequal variances. 


(d) The data are not matched pairs, the population variances are 
unknown, and both sample sizes are above 25 so, from Figure 10, use 
the two-sample z-test. 


(e) The data are not matched-pairs, the population variances are 
unknown, one sample size is less than 25, and we can assume the 
population variances are equal (17.4/10.6 ~ 1.64 < 3) so, from 
Figure 10, use the two-sample t-test with a pooled sample variance. 


Solution to Exercise 10 
(a) The data give: 











> (e-#)P =} r - au 
~ 19378 — (620)° ~ 6564.7. 
30 
Seeger Oey) 
_ (2776)? 
= 344742 — -a ~ 87 869.5. 
(2 — F)(y -—F) = Dory — Cy 
— 79136 — 020% 206 51 765.3. 
30 
Hence, 
correlation = 2 (x — Z)(y — 9) 
Vi e@=3) x > =o) 


7 21 765.3 
~ 4/6564.7 x 87 869.5 


(b) The correlation coefficient is quite large, so a straight line should be 
useful for capturing the relationship between time at the checkout and 
number of purchases. However, the relationship might be non-linear, 
in which case there would be better ways of capturing the relationship. 


~ 0.91. 
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The data on each individual customer is needed, so that a scatterplot 
of y against x could be drawn. That would enable the relationship 
between the variables to be seen much more clearly — the question of 
whether a straight-line relationship is appropriate could then be 
answered with greater confidence. 


Solution to Exercise 11 


From Exercise 10, 


y a = 620, So y=2776, and n=30. 


So 
x 620 
z= 27 = 24 29.667 
Toa 30 
and 
2776 
ga SYK ~ 99.533 
n 30 


Also, from the solution to Exercise 10, 
X (z -7)} ~ 6564.7 and D(a —2)(y — 7) ~ 21765.3. 


Hence, the slope is 
> (z -—T)(y-—7) _ 21765.3 
Si(a—Z)? 6564.7 
and the intercept is 
a = y — bT = 92.533 — 3.3155 x 20.667 
= 92.533 — 68.521 ~ 24.012. 


b= ~ 3.3155, 


These give the regression line: 
y = 24.0 + 3.322. 


The intercept of 24.0 implies the basic part of serving a customer (taking 
money and giving change, etc.) takes 24.0 seconds, on average. The slope 
of 3.32 means that each item takes 3.32 seconds to process, on average. 
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